PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Long-context large language models (LLMs) are vulnerable to prompt injection attacks, yet existing defenses fail in long contexts where malicious instructions constitute a negligible fraction of tokens. This paper proposes a prompt sanitization–based defense framework that, for the first time, leverages attention mechanisms to localize and remove injected malicious tokens prior to response generation, thereby blocking attack propagation. Its core innovation lies in identifying and exploiting a “defense paradox”: attack efficacy is positively correlated with the probability of token removal. By integrating attention analysis, instruction-following behavior detection, and context-sensitive token sanitization, the method achieves high-precision identification and elimination of injected prompts. Experiments demonstrate that, while preserving normal model functionality, our approach significantly outperforms state-of-the-art defenses against both optimized and highly adaptive prompt injection attacks.

Technology Category

Application Category

📝 Abstract

Long context LLMs are vulnerable to prompt injection, where an attacker can inject an instruction in a long context to induce an LLM to generate an attacker-desired output. Existing prompt injection defenses are designed for short contexts. When extended to long-context scenarios, they have limited effectiveness. The reason is that an injected instruction constitutes only a very small portion of a long context, making the defense very challenging. In this work, we propose PISanitizer, which first pinpoints and sanitizes potential injected tokens (if any) in a context before letting a backend LLM generate a response, thereby eliminating the influence of the injected instruction. To sanitize injected tokens, PISanitizer builds on two observations: (1) prompt injection attacks essentially craft an instruction that compels an LLM to follow it, and (2) LLMs intrinsically leverage the attention mechanism to focus on crucial input tokens for output generation. Guided by these two observations, we first intentionally let an LLM follow arbitrary instructions in a context and then sanitize tokens receiving high attention that drive the instruction-following behavior of the LLM. By design, PISanitizer presents a dilemma for an attacker: the more effectively an injected instruction compels an LLM to follow it, the more likely it is to be sanitized by PISanitizer. Our extensive evaluation shows that PISanitizer can successfully prevent prompt injection, maintain utility, outperform existing defenses, is efficient, and is robust to optimization-based and strong adaptive attacks. The code is available at https://github.com/sleeepeer/PISanitizer.

Problem

Research questions and friction points this paper is trying to address.

Preventing prompt injection attacks in long-context large language models

Addressing limited effectiveness of existing defenses in long-context scenarios

Sanitizing injected tokens that compel LLMs to follow malicious instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sanitizes injected tokens before LLM response generation

Uses attention mechanism to identify malicious instruction tokens

Creates dilemma where effective attacks become more detectable

🔎 Similar Papers

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization