🤖 AI Summary
Prompt injection attacks pose a significant security threat to large language model (LLM) applications, yet existing detection methods suffer from low accuracy or high computational overhead. To address this, we propose a lightweight and efficient detection method based on internal representation analysis. Our key insight is the identification of a critical hidden layer within LLMs that exhibits high sensitivity to prompt injection: the final-token representations at this layer show statistically separable distributions between benign and malicious prompts. Leveraging this property, our method extracts only the layer’s output and trains a lightweight linear classifier—requiring minimal computation and no model fine-tuning. Evaluated across five benchmark datasets and eight prevalent attack types, our approach consistently outperforms eleven state-of-the-art baselines in detection accuracy, while maintaining negligible inference overhead. It further demonstrates strong generalization across unseen attack variants and robustness against adaptive adversarial attacks.
📝 Abstract
LLM-integrated applications are vulnerable to prompt injection attacks, where an attacker contaminates the input to inject malicious prompts, causing the LLM to follow the attacker's intent instead of the original user's. Existing prompt injection detection methods often have sub-optimal performance and/or high computational overhead. In this work, we propose PIShield, a detection method that is both effective and efficient. Our key observation is that the internal representation of the final token in a prompt-extracted from a specific layer of the LLM, which we term the injection-critical layer-captures distinguishing features between clean and contaminated prompts. Leveraging this insight, we train a simple linear classifier on these internal representations using a labeled set of clean and contaminated prompts. We compare PIShield against 11 baselines across 5 diverse benchmark datasets and 8 prompt injection attacks. The results demonstrate that PIShield is both highly effective and efficient, substantially outperforming existing methods. Additionally, we show that PIShield resists strong adaptive attacks.