DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to prompt injection attacks in instruction-following tasks due to their limited semantic role discrimination capability—specifically, their inability to reliably distinguish instructional intent from descriptive content. To address this, we propose a semantics-aware instruction-data separation framework. Our method introduces: (1) a token-level de-instructionalization mechanism that explicitly identifies and removes spurious instructional tokens from inputs; and (2) a residual fusion pathway jointly optimized with role-aware supervision to refine the representation space for robust role disentanglement. The approach is lightweight and plug-and-play. Evaluated on LLaMA-8B and Mistral-7B, it improves instruction-role separation accuracy by 49%, reduces success rates of adaptive prompt injection attacks by 66%, and preserves original task performance without degradation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated impressive instruction-following capabilities. However, these capabilities also expose models to prompt injection attacks, where maliciously crafted inputs overwrite or distract from the intended instructions. A core vulnerability lies in the model's lack of semantic role understanding: it cannot distinguish directive intent from descriptive content, leading it to execute instruction-like phrases embedded in data. We propose DRIP, a training-time defense grounded in a semantic modeling perspective, which enforces robust separation between instruction and data semantics without sacrificing utility. DRIP introduces two lightweight yet complementary mechanisms: (1) a token-wise de-instruction shift that performs semantic disentanglement, weakening directive semantics in data tokens while preserving content meaning; and (2) a residual fusion pathway that provides a persistent semantic anchor, reinforcing the influence of the true top-level instruction during generation. Experimental results on LLaMA-8B and Mistral-7B across three prompt injection benchmarks (SEP, AlpacaFarm, and InjecAgent) demonstrate that DRIP outperforms state-of-the-art defenses, including StruQ, SecAlign, ISE, and PFT, improving role separation by 49%, and reducing attack success rate by 66% for adaptive attacks. Meanwhile, DRIP's utility is on par with the undefended model across AlpacaEval, IFEval, and MT-Bench. Our findings underscore the power of lightweight representation edits and role-aware supervision in securing LLMs against adaptive prompt injection.

Problem

Research questions and friction points this paper is trying to address.

Defends LLMs against malicious prompt injection attacks

Separates instruction semantics from descriptive content

Maintains model utility while improving role separation

Innovation

Methods, ideas, or system contributions that make the work stand out.

De-instruction training disentangles directive semantics from data

Residual fusion pathway reinforces top-level instruction influence

Lightweight representation edits secure models without utility loss

🔎 Similar Papers

No similar papers found.