Soft Instruction De-escalation Defense

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Prompt injection attacks pose a critical security threat to tool-augmented large language model (LLM) agents operating in untrusted environments. To address this, we propose SIC—a novel iterative soft instruction cleansing mechanism. SIC dynamically detects and rectifies missed malicious instructions through multiple rounds of detection and semantic rewriting, overcoming the limitations of single-shot purification. It introduces three key innovations: (1) instruction residue detection to identify residual adversarial content, (2) a maximum iteration bound to ensure computational efficiency, and (3) a safety interruption policy to halt execution upon persistent threats—thereby balancing robustness and controllability. Extensive experiments demonstrate that SIC significantly outperforms existing single-pass methods: under worst-case conditions with strong adversaries, it reduces attack success rates to just 15%, substantially raising the bar for successful exploitation. This work establishes a scalable, lightweight, and practical defense paradigm for securing LLM agents in open, real-world environments.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

Problem

Research questions and friction points this paper is trying to address.

Defending LLM agents from malicious prompt injections in untrusted data

Sanitizing inputs by detecting and rewriting harmful instruction content

Preventing compromised agent behavior through iterative security checks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative prompt sanitization loop for security

Rewrites or removes malicious content in data

Multiple passes catch missed injections progressively

🔎 Similar Papers

No similar papers found.