Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address the vulnerability of large language models (LLMs) to indirect prompt injection (IPI) attacks in agent-based applications, this paper proposes Rennervate—a novel defense framework grounded in fine-grained, token-level attention feature analysis. Rennervate introduces a two-stage attention pooling mechanism that jointly aggregates attention heads and response tokens to enable precise detection and real-time blocking of covert malicious instructions. Complementing the framework, we release FIPI—the first open-source, fine-grained IPI benchmark dataset—designed to support reproducible evaluation. Extensive experiments across five mainstream LLMs and six benchmark datasets demonstrate that Rennervate significantly outperforms 15 state-of-the-art commercial and academic defenses, achieving new SOTA performance in detection accuracy, cross-model generalizability, and robustness against adaptive adversarial attacks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks. However, LLM-empowered applications are vulnerable to Indirect Prompt Injection (IPI) attacks, where instructions are injected via untrustworthy external data sources. This paper presents Rennervate, a defense framework to detect and prevent IPI attacks. Rennervate leverages attention features to detect the covert injection at a fine-grained token level, enabling precise sanitization that neutralizes IPI attacks while maintaining LLM functionalities. Specifically, the token-level detector is materialized with a 2-step attentive pooling mechanism, which aggregates attention heads and response tokens for IPI detection and sanitization. Moreover, we establish a fine-grained IPI dataset, FIPI, to be open-sourced to support further research. Extensive experiments verify that Rennervate outperforms 15 commercial and academic IPI defense methods, achieving high precision on 5 LLMs and 6 datasets. We also demonstrate that Rennervate is transferable to unseen attacks and robust against adaptive adversaries.

Problem

Research questions and friction points this paper is trying to address.

Defends against Indirect Prompt Injection attacks in LLMs

Detects covert injections at a fine-grained token level

Ensures high precision while maintaining model functionality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages attention features for fine-grained token-level detection

Uses a 2-step attentive pooling mechanism for aggregation

Establishes a fine-grained IPI dataset to support research

🔎 Similar Papers

No similar papers found.