🤖 AI Summary
Existing benchmarks inadequately evaluate large language models’ (LLMs) capabilities in realistic, complex tasks—particularly tool retrieval, multi-hop planning, cross-tool coordination, and precise parameter control. Method: We introduce the first Multi-Step Task Benchmark grounded in the Model Context Protocol (MCP), integrating 28 live MCP servers and 250 tools across finance, travel, scientific computing, and academic search. Tasks eschew explicit tool hints, requiring models to autonomously retrieve tools, plan multi-step execution trajectories, and orchestrate cross-domain workflows from ambiguous instructions. We propose a three-tier evaluation framework assessing tool understanding, trajectory planning, and task completion. Results: Evaluations on 20 state-of-the-art LLMs reveal persistent bottlenecks in cross-tool coordination and complex reasoning, exposing critical limitations in current LLM agent capabilities.
📝 Abstract
We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.