🤖 AI Summary
Existing MCP (Multi-Step Command Planning) benchmarks suffer from narrow coverage and oversimplified tasks, failing to reflect the complexity of real-world workflows.
Method: We propose the first real-world-oriented MCP benchmark framework, comprising 127 high-complexity tasks co-designed by domain experts and AI. These tasks require agents to perform deep CRUD operations across heterogeneous environments. We introduce programmatic verification and initial-state preconfiguration to enable comprehensive evaluation spanning both interaction breadth and depth. A lightweight agent framework integrates automated script validation, multi-turn interaction control, and CRUD simulation for fine-grained LLM-agent behavioral assessment.
Contribution/Results: Our benchmark establishes a new standard in rigor and practicality. Experiments show that the state-of-the-art model, GPT-5-Medium, achieves only 52.56% pass@1 and 33.86% pass⁴, requiring on average 16.2 interaction turns and 17.4 tool calls—demonstrating substantially increased difficulty and realism over prior benchmarks.
📝 Abstract
MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$% pass@1 and $33.86$% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$% pass@1 and $15$% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.