PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Prompt injection attacks pose a significant security threat to large language model (LLM) applications, yet existing detection methods suffer from low accuracy or high computational overhead. To address this, we propose a lightweight and efficient detection method based on internal representation analysis. Our key insight is the identification of a critical hidden layer within LLMs that exhibits high sensitivity to prompt injection: the final-token representations at this layer show statistically separable distributions between benign and malicious prompts. Leveraging this property, our method extracts only the layer’s output and trains a lightweight linear classifier—requiring minimal computation and no model fine-tuning. Evaluated across five benchmark datasets and eight prevalent attack types, our approach consistently outperforms eleven state-of-the-art baselines in detection accuracy, while maintaining negligible inference overhead. It further demonstrates strong generalization across unseen attack variants and robustness against adaptive adversarial attacks.

Technology Category

Application Category

📝 Abstract

LLM-integrated applications are vulnerable to prompt injection attacks, where an attacker contaminates the input to inject malicious prompts, causing the LLM to follow the attacker's intent instead of the original user's. Existing prompt injection detection methods often have sub-optimal performance and/or high computational overhead. In this work, we propose PIShield, a detection method that is both effective and efficient. Our key observation is that the internal representation of the final token in a prompt-extracted from a specific layer of the LLM, which we term the injection-critical layer-captures distinguishing features between clean and contaminated prompts. Leveraging this insight, we train a simple linear classifier on these internal representations using a labeled set of clean and contaminated prompts. We compare PIShield against 11 baselines across 5 diverse benchmark datasets and 8 prompt injection attacks. The results demonstrate that PIShield is both highly effective and efficient, substantially outperforming existing methods. Additionally, we show that PIShield resists strong adaptive attacks.

Problem

Research questions and friction points this paper is trying to address.

Detecting prompt injection attacks in LLM applications

Identifying malicious prompts using intrinsic LLM features

Developing efficient detection method with linear classifier

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses internal LLM representations for detection

Trains linear classifier on injection-critical layer

Detects attacks via final token features

🔎 Similar Papers

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models