🤖 AI Summary
This work addresses the vulnerability of tool-augmented large language model (LLM) agents to indirect prompt injection attacks, wherein adversaries manipulate tool outputs to steer agents toward executing malicious instructions. The authors propose the first runtime security framework that requires no modifications to the underlying model or system architecture. By automatically deriving task-level access constraints from user objectives and enforcing them at tool-call boundaries, the framework establishes a deterministic and auditable interception mechanism. Built upon a rule-based engine, it comprehensively blocks three distinct indirect injection pathways. Extensive evaluation across five prominent LLMs and benchmarks—including AgentDojo, SkillInject, and MCPSafeBench—demonstrates that the approach significantly enhances security without compromising task utility.
📝 Abstract
Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.