π€ AI Summary
To address implicit prompt injection (IPI)βa critical threat causing LLM behavioral hijacking in external-data-augmented scenarios such as RAGβthis paper proposes the first real-time IPI detection method grounded in internal model dynamics. Our approach innovatively jointly models multi-layer hidden states during forward propagation and gradient variations during backward propagation to characterize instruction-triggered behavioral state shifts, enabling cross-layer feature fusion and instruction-behavior discriminative learning. Crucially, it requires no prompt engineering or external annotations, ensuring strong generalizability and robustness. Evaluated on in-domain and out-of-domain benchmarks, our method achieves 99.60% and 96.90% detection accuracy, respectively. Under the BIPIA benchmark, it reduces attack success rate to just 0.12%, substantially outperforming existing methods. This work establishes a foundational, model-intrinsic paradigm for secure, adaptive IPI defense in production LLM systems.
π Abstract
The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that the success of IPI attacks fundamentally relies in the presence of instructions embedded within external content, which can alter the behavioral state of LLMs. Can effectively detecting such state changes help us defend against IPI attacks? In this paper, we propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks. Specifically, we demonstrate that the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, our approach achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, while reducing the attack success rate to just 0.12% on the BIPIA benchmark.