🤖 AI Summary
AI agents remain vulnerable to indirect prompt injection attacks, yet existing benchmarks suffer from flawed evaluation metrics, implementation errors, and insufficient attack strength. Method: This paper proposes a modular, model-agnostic dual-firewall defense framework: a Tool-Input Firewall that minimizes and sanitizes agent-generated tool inputs, and a Tool-Output Firewall that purifies tool responses—jointly disrupting malicious instruction propagation at the tool interaction layer. Contribution/Results: Through rigorous adversarial analysis and benchmark remediation, we systematically identify and rectify critical evaluation flaws across multiple mainstream benchmarks. Experiments demonstrate that our approach reduces attack success rates to near zero across four major public benchmarks while preserving high task completion performance—achieving the state-of-the-art trade-off between security and utility. This work establishes a new paradigm for robust agent security evaluation and defense.
📝 Abstract
AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security (0% or the lowest possible attack success rate) with high utility (task success rate) across four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this firewall defense makes minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance without compromising utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering significant progress in the field. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design. Further, we demonstrate that although these firewalls push the state-of-the-art on existing benchmarks, it is still possible to bypass them in practice, underscoring the need to incorporate stronger attacks in security benchmarks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger agentic security benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.