ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
Current large language model agents struggle to handle challenges in real-world business software automation, such as inter-tool dependencies, environmental noise, and dynamic states. This work proposes the first deterministic evaluation framework tailored for interdependent toolchains and dynamic environments, built upon the Model Context Protocol. The framework features a benchmark comprising seven stateful sandboxes and over 300 tools, employing a seed-driven mechanism to simulate unpredictable API failures and environmental evolution. It further enables fine-grained trajectory analysis through full-context and retrieval-augmented generation (RAG) paradigms. Experimental results reveal that even state-of-the-art models achieve success rates below 60%, substantially lagging behind human performance at 90%. The study identifies three critical bottlenecks—tool retrieval saturation, overconfidence, and strategic abandonment—thereby establishing a foundational evaluation platform for next-generation robust autonomous systems.
📝 Abstract
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
tool interdependence
dynamic environments
software automation
environmental noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

ComplexMCP
Model Context Protocol
tool interdependence
dynamic environment simulation
LLM agent evaluation