MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate large language models’ (LLMs) capabilities in realistic, complex tasks—particularly tool retrieval, multi-hop planning, cross-tool coordination, and precise parameter control. Method: We introduce the first Multi-Step Task Benchmark grounded in the Model Context Protocol (MCP), integrating 28 live MCP servers and 250 tools across finance, travel, scientific computing, and academic search. Tasks eschew explicit tool hints, requiring models to autonomously retrieve tools, plan multi-step execution trajectories, and orchestrate cross-domain workflows from ambiguous instructions. We propose a three-tier evaluation framework assessing tool understanding, trajectory planning, and task completion. Results: Evaluations on 20 state-of-the-art LLMs reveal persistent bottlenecks in cross-tool coordination and complex reasoning, exposing critical limitations in current LLM agent capabilities.

Technology Category

Application Category

📝 Abstract
We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM agents on realistic multi-step tool-using tasks
Evaluating cross-tool coordination and complex workflow orchestration capabilities
Testing tool retrieval from fuzzy instructions without explicit tool names
Innovation

Methods, ideas, or system contributions that make the work stand out.

MCP-Bench benchmark with 28 live servers
250 cross-domain tools for complex workflows
Multi-faceted evaluation framework for tool-using agents
🔎 Similar Papers
No similar papers found.
Zhenting Wang
Zhenting Wang
Accenture; Rutgers University
Q
Qi Chang
Center for Advanced AI, Accenture
H
Hemani Patel
Center for Advanced AI, Accenture, UC Berkeley
S
Shashank Biju
Center for Advanced AI, Accenture, UC Berkeley
Cheng-En Wu
Cheng-En Wu
University of Wisconsin-Madison
Computer VisionMachine LearningDeep LearningAI
Q
Quan Liu
Center for Advanced AI, Accenture
Aolin Ding
Aolin Ding
Security Research Scientist, Accenture
Alireza Rezazadeh
Alireza Rezazadeh
University of Minnesota
A
Ankit Shah
Center for Advanced AI, Accenture
Yujia Bao
Yujia Bao
Massachusetts Institute of Technology
Machine LearningNatural Language Processing
E
Eugene Siow
Center for Advanced AI, Accenture