🤖 AI Summary
This work addresses the security vulnerability of tool-augmented agents, which are susceptible to malicious inputs that degrade their efficiency without compromising output correctness. The authors propose Sponge Tool Attack (STA), a novel "denial-of-efficiency" attack that requires only access to user queries and does not necessitate modifications to either the model or the tools. STA exploits semantically faithful prompt rewriting to induce agents into generating unnecessarily verbose yet correct reasoning trajectories, thereby covertly exhausting computational resources. Built upon an iterative multi-agent collaboration framework with explicit rewriting strategies, STA demonstrates strong effectiveness across six models, twelve tools, four agent frameworks, and thirteen cross-domain datasets. The attack preserves answer accuracy while significantly inflating reasoning overhead, revealing for the first time a previously overlooked, stealthy attack surface inherent in tool-calling mechanisms, with notable generalizability and concealment.
📝 Abstract
Enabling large language models (LLMs) to solve complex reasoning tasks is a key step toward artificial general intelligence. Recent work augments LLMs with external tools to enable agentic reasoning, achieving high utility and efficiency in a plug-and-play manner. However, the inherent vulnerabilities of such methods to malicious manipulation of the tool-calling process remain largely unexplored. In this work, we identify a tool-specific attack surface and propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt under a strict query-only access assumption. Without any modification on the underlying model or the external tools, STA converts originally concise and efficient reasoning trajectories into unnecessarily verbose and convoluted ones before arriving at the final answer. This results in substantial computational overhead while remaining stealthy by preserving the original task semantics and user intent. To achieve this, we design STA as an iterative, multi-agent collaborative framework with explicit rewritten policy control, and generates benign-looking prompt rewrites from the original one with high semantic fidelity. Extensive experiments across 6 models (including both open-source models and closed-source APIs), 12 tools, 4 agentic frameworks, and 13 datasets spanning 5 domains validate the effectiveness of STA.