Defending against Indirect Prompt Injection by Instruction Detection

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address implicit prompt injection (IPI)—a critical threat causing LLM behavioral hijacking in external-data-augmented scenarios such as RAG—this paper proposes the first real-time IPI detection method grounded in internal model dynamics. Our approach innovatively jointly models multi-layer hidden states during forward propagation and gradient variations during backward propagation to characterize instruction-triggered behavioral state shifts, enabling cross-layer feature fusion and instruction-behavior discriminative learning. Crucially, it requires no prompt engineering or external annotations, ensuring strong generalizability and robustness. Evaluated on in-domain and out-of-domain benchmarks, our method achieves 99.60% and 96.90% detection accuracy, respectively. Under the BIPIA benchmark, it reduces attack success rate to just 0.12%, substantially outperforming existing methods. This work establishes a foundational, model-intrinsic paradigm for secure, adaptive IPI defense in production LLM systems.

Technology Category

Application Category

📝 Abstract

The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that the success of IPI attacks fundamentally relies in the presence of instructions embedded within external content, which can alter the behavioral state of LLMs. Can effectively detecting such state changes help us defend against IPI attacks? In this paper, we propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks. Specifically, we demonstrate that the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.

Problem

Research questions and friction points this paper is trying to address.

Detects hidden instructions in external data to prevent LLM manipulation

Uses LLM behavioral states and gradients for instruction detection

Reduces Indirect Prompt Injection attack success rate significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects IPI attacks using LLM behavioral states

Utilizes hidden states and gradients as features

Achieves high accuracy in instruction detection

🔎 Similar Papers

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models