MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This paper identifies “Tool Poisoning” as a novel security threat in the Model Context Protocol (MCP), wherein malicious instructions exploit compromised tool metadata—without requiring tool execution—to compromise real-world MCP servers and autonomous agent ecosystems. Method: The authors formally define this attack paradigm, introduce MCPTox—the first large-scale, realistic benchmark comprising 45 production MCP servers and 353 real tools—and design three attack templates. Leveraging few-shot learning, they generate 1,312 malicious test cases covering ten distinct risk categories. Contribution/Results: Evaluations across 20 state-of-the-art LLM-based agents reveal up to 72.8% attack success rates; existing safety alignment mechanisms fail catastrophically (rejection rate <3%). Critically, the study demonstrates an inverse correlation between model capability and robustness: stronger models exhibit higher vulnerability—a previously unreported phenomenon that challenges foundational assumptions in LLM safety research.

Technology Category

Application Category

📝 Abstract

By providing a standardized interface for LLM agents to interact with external tools, the Model Context Protocol (MCP) is quickly becoming a cornerstone of the modern autonomous agent ecosystem. However, it creates novel attack surfaces due to untrusted external tools. While prior work has focused on attacks injected through external tool outputs, we investigate a more fundamental vulnerability: Tool Poisoning, where malicious instructions are embedded within a tool's metadata without execution. To date, this threat has been primarily demonstrated through isolated cases, lacking a systematic, large-scale evaluation. We introduce MCPTox, the first benchmark to systematically evaluate agent robustness against Tool Poisoning in realistic MCP settings. MCPTox is constructed upon 45 live, real-world MCP servers and 353 authentic tools. To achieve this, we design three distinct attack templates to generate a comprehensive suite of 1312 malicious test cases by few-shot learning, covering 10 categories of potential risks. Our evaluation on 20 prominent LLM agents setting reveals a widespread vulnerability to Tool Poisoning, with o1-mini, achieving an attack success rate of 72.8%. We find that more capable models are often more susceptible, as the attack exploits their superior instruction-following abilities. Finally, the failure case analysis reveals that agents rarely refuse these attacks, with the highest refused rate (Claude-3.7-Sonnet) less than 3%, demonstrating that existing safety alignment is ineffective against malicious actions that use legitimate tools for unauthorized operation. Our findings create a crucial empirical baseline for understanding and mitigating this widespread threat, and we release MCPTox for the development of verifiably safer AI agents. Our dataset is available at an anonymized repository: extit{https://anonymous.4open.science/r/AAAI26-7C02}.

Problem

Research questions and friction points this paper is trying to address.

Evaluating agent vulnerability to tool poisoning attacks

Assessing malicious instructions in tool metadata risks

Systematically testing real-world MCP server security threats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for tool poisoning attacks on MCP servers

Three attack templates generate malicious test cases

Evaluates agent robustness with real-world tools

🔎 Similar Papers

No similar papers found.