When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically investigates Tool Description Poisoning (TDP) attacks against tool-augmented large language model (LLM) agents, wherein adversaries inject malicious instructions into tool documentation to manipulate cognitive planning. To this end, we introduce MCP-TDP, the first high-fidelity security benchmark based on the Model Context Protocol (MCP), comprising 32 realistic test cases spanning six risk categories. Our evaluation of eight mainstream LLMs reveals severe vulnerabilities—e.g., GPT-4o exhibits nearly 100% attack success—and demonstrates that common prompt-based defenses are not only ineffective but can exacerbate risks, a phenomenon we term the “firewall fallacy.” We further propose a reactive self-correction defense mechanism that significantly enhances agent robustness against TDP attacks.

📝 Abstract

The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

Problem

Research questions and friction points this paper is trying to address.

Tool Description Poisoning

LLM Agents

Model Context Protocol

Security Benchmark

Cognitive Planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool Description Poisoning

MCP-TDP Security Benchmark

Reactive Self-Correction