MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing benchmarks inadequately evaluate large language models’ (LLMs) capabilities in realistic, complex tasks—particularly tool retrieval, multi-hop planning, cross-tool coordination, and precise parameter control. Method: We introduce the first Multi-Step Task Benchmark grounded in the Model Context Protocol (MCP), integrating 28 live MCP servers and 250 tools across finance, travel, scientific computing, and academic search. Tasks eschew explicit tool hints, requiring models to autonomously retrieve tools, plan multi-step execution trajectories, and orchestrate cross-domain workflows from ambiguous instructions. We propose a three-tier evaluation framework assessing tool understanding, trajectory planning, and task completion. Results: Evaluations on 20 state-of-the-art LLMs reveal persistent bottlenecks in cross-tool coordination and complex reasoning, exposing critical limitations in current LLM agent capabilities.

Technology Category

Application Category

📝 Abstract

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking LLM agents on realistic multi-step tool-using tasks

Evaluating cross-tool coordination and complex workflow orchestration capabilities

Testing tool retrieval from fuzzy instructions without explicit tool names

Innovation

Methods, ideas, or system contributions that make the work stand out.

MCP-Bench benchmark with 28 live servers

250 cross-domain tools for complex workflows

Multi-faceted evaluation framework for tool-using agents

🔎 Similar Papers

TDAG: A Multi-Agent Framework based on Dynamic Task Decomposition and Agent Generation