When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates Tool Description Poisoning (TDP) attacks against tool-augmented large language model (LLM) agents, wherein adversaries inject malicious instructions into tool documentation to manipulate cognitive planning. To this end, we introduce MCP-TDP, the first high-fidelity security benchmark based on the Model Context Protocol (MCP), comprising 32 realistic test cases spanning six risk categories. Our evaluation of eight mainstream LLMs reveals severe vulnerabilities—e.g., GPT-4o exhibits nearly 100% attack success—and demonstrates that common prompt-based defenses are not only ineffective but can exacerbate risks, a phenomenon we term the “firewall fallacy.” We further propose a reactive self-correction defense mechanism that significantly enhances agent robustness against TDP attacks.
📝 Abstract
The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.
Problem

Research questions and friction points this paper is trying to address.

Tool Description Poisoning
LLM Agents
Model Context Protocol
Security Benchmark
Cognitive Planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool Description Poisoning
MCP-TDP Security Benchmark
Reactive Self-Correction
Firewall Fallacy
LLM Agent Security
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
S
Shi Liu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
X
Xuehai Tang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
X
Xikang Yang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
B
Biyu Zhou
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
W
Wenjie Xiao
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
W
Wantao Liu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China