MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing CUA (Computer-Use Agent) benchmarks overly rely on GUI-based interaction, making them vulnerable to UI changes and neglecting deeper interaction modalities such as APIs—including the Model Context Protocol (MCP). Method: We propose the first unified evaluation benchmark supporting API-only, GUI-only, and API-GUI hybrid interaction, built upon modifiable “white-box” applications with full source-code access. Our white-box-driven evaluation paradigm integrates MCP support, dynamic code instrumentation, and containerized deployment, enabling precise task execution verification and cross-modal automated assessment across 201 annotated tasks. Contribution/Results: Experiments demonstrate that the LLM-CUA framework achieves a task completion accuracy of 75.12%. All code and datasets are publicly released to foster reproducible research and community advancement.

Technology Category

Application Category

📝 Abstract

(M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of"white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld.

Problem

Research questions and friction points this paper is trying to address.

Lack of unified benchmarking for API, GUI, and hybrid computer use agents

Existing GUI agent benchmarks are fragile to UI changes

No robust evaluation method for API-exposed function interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified testbed for API, GUI, hybrid agents

White-box apps enable dynamic code instrumentation

Containerized with GPU acceleration support

🔎 Similar Papers

AgentStudio: A Toolkit for Building General Virtual Agents