🤖 AI Summary
This study systematically investigates Tool Description Poisoning (TDP) attacks against tool-augmented large language model (LLM) agents, wherein adversaries inject malicious instructions into tool documentation to manipulate cognitive planning. To this end, we introduce MCP-TDP, the first high-fidelity security benchmark based on the Model Context Protocol (MCP), comprising 32 realistic test cases spanning six risk categories. Our evaluation of eight mainstream LLMs reveals severe vulnerabilities—e.g., GPT-4o exhibits nearly 100% attack success—and demonstrates that common prompt-based defenses are not only ineffective but can exacerbate risks, a phenomenon we term the “firewall fallacy.” We further propose a reactive self-correction defense mechanism that significantly enhances agent robustness against TDP attacks.
📝 Abstract
The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.