MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing MCP (Multi-Step Command Planning) benchmarks suffer from narrow coverage and oversimplified tasks, failing to reflect the complexity of real-world workflows. Method: We propose the first real-world-oriented MCP benchmark framework, comprising 127 high-complexity tasks co-designed by domain experts and AI. These tasks require agents to perform deep CRUD operations across heterogeneous environments. We introduce programmatic verification and initial-state preconfiguration to enable comprehensive evaluation spanning both interaction breadth and depth. A lightweight agent framework integrates automated script validation, multi-turn interaction control, and CRUD simulation for fine-grained LLM-agent behavioral assessment. Contribution/Results: Our benchmark establishes a new standard in rigor and practicality. Experiments show that the state-of-the-art model, GPT-5-Medium, achieves only 52.56% pass@1 and 33.86% pass⁴, requiring on average 16.2 interaction turns and 17.4 tool calls—demonstrating substantially increased difficulty and realism over prior benchmarks.

Technology Category

Application Category

📝 Abstract

MCP standardizes how LLMs interact with external systems, forming the foundation for general agents. However, existing MCP benchmarks remain narrow in scope: they focus on read-heavy tasks or tasks with limited interaction depth, and fail to capture the complexity and realism of real-world workflows. To address this gap, we propose MCPMark, a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner. It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents. Each task begins with a curated initial state and includes a programmatic script for automatic verification. These tasks demand richer and more diverse interactions with the environment, involving a broad range of create, read, update, and delete (CRUD) operations. We conduct a comprehensive evaluation of cutting-edge LLMs using a minimal agent framework that operates in a tool-calling loop. Empirical results show that the best-performing model, gpt-5-medium, reaches only $52.56$% pass@1 and $33.86$% pass^4, while other widely regarded strong models, including claude-sonnet-4 and o3, fall below $30$% pass@1 and $15$% pass^4. On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks and highlighting the stress-testing nature of MCPMark.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MCP use in realistic workflows

Addressing narrow scope of existing MCP benchmarks

Testing comprehensive CRUD operations with environment interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

MCPMark benchmark tests realistic MCP interactions

Tasks include CRUD operations for diverse workflows

Automated verification via programmatic scripts for evaluation

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis