CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

As LLM agents gain expanded tool and data access, indirect prompt injection attacks pose increasingly severe security risks; existing defenses suffer from high false-positive rates due to context dependency, limiting practical deployment. Method: We propose a fine-grained, token-level prompt sanitization framework grounded in safety principles. It introduces an instruction–data separation mechanism that non-intrusively and context-independently strips adversarial instructions—without requiring attack-specific training data—while preserving agent functionality. The approach integrates a token classifier trained on instruction-tuning data with instruction detection and stripping modules deployed at tool output interfaces. Contribution/Results: Our method achieves generalized defense against diverse injection variants. On benchmarks including AgentDojo, it reduces attack success rate from 34% to 3%—a 7–10× improvement—with zero runtime overhead or performance degradation.

Technology Category

Application Category

📝 Abstract

The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.

Problem

Research questions and friction points this paper is trying to address.

Securing AI agents against indirect prompt injection attacks

Reducing false positives in malicious instruction detection

Sanitizing tool outputs by removing executable AI instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level sanitization removes AI-directed instructions from outputs

Non-blocking approach requires no calibration or context awareness

Trains predictors using only standard instruction-tuning data

🔎 Similar Papers

No similar papers found.