LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing AI agents lack systematic evaluation of their ability to collaboratively orchestrate diverse MCP tools for multi-step tasks in dynamic, real-world scenarios. Method: We introduce LiveMCP-101, a benchmark comprising 101 realistic, complex queries—first to assess agent-level multi-tool orchestration under actual execution paths. Departing from conventional API-response-based evaluation, we propose a fine-grained, execution-trajectory-driven assessment framework integrating LLM-based query rewriting, human verification, and multimodal tool support (web search, file operations, mathematical reasoning, and data analysis), complemented by deep error attribution analysis. Contribution/Results: Experiments reveal that state-of-the-art large language models achieve an overall success rate below 60%, exposing critical bottlenecks—including inconsistent tool scheduling and inefficient token utilization. This work establishes a reproducible benchmark, introduces a novel evaluation paradigm grounded in real execution, and identifies concrete optimization directions for advancing practical AI agent deployment.

Technology Category

Application Category

📝 Abstract

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking AI agents' multi-step task solving with MCP tools

Evaluating tool orchestration in realistic dynamic environments

Identifying failure modes and inefficiencies in token usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with 101 real-world multi-tool queries

Evaluation using ground-truth execution plans

Analysis of tool orchestration failure modes

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis