MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately assess language agents’ authentic tool-interaction capabilities under the Model Context Protocol (MCP) paradigm, leading to inaccurate performance estimation and poor capability discrimination. To address this, we propose MCP-Eval—the first systematic evaluation benchmark and methodology specifically designed for MCP. It comprises a rigorously constructed test suite of 600 complex queries spanning six real-world task categories; a standardized MCP testing platform integrating 33 servers and 188 tools; and a result-oriented, fully automated evaluation framework supporting multi-turn, end-to-end tool invocation assessment. Empirical evaluation of state-of-the-art agents reveals substantial and previously undetected performance disparities in realistic MCP environments. MCP-Eval establishes a reproducible, extensible, and standardized evaluation infrastructure for advancing research on language-agent tool orchestration.

Technology Category

Application Category

📝 Abstract
The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP's growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP's transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating real-world language agent performance with MCP tools
Assessing agent capabilities in MCP-mediated tool interactions
Providing standardized framework for validating interoperable AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

MCP testbed with 33 servers
600 queries across 6 categories
Outcome-oriented MCP-Eval methodology
🔎 Similar Papers
No similar papers found.