MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

📅 2025-08-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Evaluating MCP (Model Calling Protocol) tools faces four key challenges: absence of multi-domain benchmarks, heterogeneous response formats, volatile call success rates, and context-window constraints limiting tool invocation scale. Method: We introduce MCPBench—the first large-scale, multi-domain benchmark for AI agent tool usage—comprising 40+ categories and 4,000+ real-world MCP servers, constructed from MCP Marketplace and GitHub community data, with curated single-step and multi-step invocation datasets. We propose a standardized evaluation protocol that unifies format handling, models success-rate uncertainty, and mitigates context-length limitations. Contribution/Results: We conduct systematic evaluations across multiple state-of-the-art embodied LLMs, delivering the first cross-domain, reproducible benchmark of MCP tool-calling capability. MCPBench establishes foundational infrastructure for advancing LLM-based tool utilization research.

Technology Category

Application Category

📝 Abstract

LLMs' capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents' MCP tool use abilities suffer from several issues. First, there's a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies across different MCP servers. Furthermore, the LLMs' context window also limits the number of available tools that can be called in a single run, because the textual descriptions of tool and the parameters have long token length for an LLM to process all at once. To help address the challenges of evaluating LLMs' performance on calling MCP tools, we propose MCPToolBench++, a large-scale, multi-domain AI Agent tool use benchmark. As of July 2025, this benchmark is build upon marketplace of over 4k MCP servers from more than 40 categories, collected from the MCP marketplaces and GitHub communities. The datasets consist of both single-step and multi-step tool calls across different categories. We evaluated SOTA LLMs with agentic abilities on this benchmark and reported the results.

Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive datasets to evaluate MCP tool use

Diverse response formats increase MCP tool evaluation difficulty

LLMs' context window limits available tools per run

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized Model Context Protocol for LLMs

Large-scale multi-domain AI Agent benchmark

Evaluates single and multi-step tool calls

🔎 Similar Papers

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities