MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing CUA (Computer-Use Agent) benchmarks overly rely on GUI-based interaction, making them vulnerable to UI changes and neglecting deeper interaction modalities such as APIs—including the Model Context Protocol (MCP). Method: We propose the first unified evaluation benchmark supporting API-only, GUI-only, and API-GUI hybrid interaction, built upon modifiable “white-box” applications with full source-code access. Our white-box-driven evaluation paradigm integrates MCP support, dynamic code instrumentation, and containerized deployment, enabling precise task execution verification and cross-modal automated assessment across 201 annotated tasks. Contribution/Results: Experiments demonstrate that the LLM-CUA framework achieves a task completion accuracy of 75.12%. All code and datasets are publicly released to foster reproducible research and community advancement.

Technology Category

Application Category

📝 Abstract
(M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of"white-box apps", i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at https://github.com/SAAgent/MCPWorld.
Problem

Research questions and friction points this paper is trying to address.

Lack of unified benchmarking for API, GUI, and hybrid computer use agents
Existing GUI agent benchmarks are fragile to UI changes
No robust evaluation method for API-exposed function interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified testbed for API, GUI, hybrid agents
White-box apps enable dynamic code instrumentation
Containerized with GPU acceleration support
🔎 Similar Papers
No similar papers found.
Y
Yunhe Yan
Beijing University of Posts and Telecommunications, Pengcheng Laboratory
S
Shihe Wang
Beijing University of Posts and Telecommunications
J
Jiajun Du
Beijing University of Posts and Telecommunications
Y
Yexuan Yang
Beijing University of Posts and Telecommunications
Y
Yuxuan Shan
Beijing University of Posts and Telecommunications
Q
Qichen Qiu
Beijing University of Posts and Telecommunications
X
Xianqing Jia
Beijing University of Posts and Telecommunications
X
Xinge Wang
Beijing University of Posts and Telecommunications
X
Xin Yuan
Beijing University of Posts and Telecommunications
X
Xu Han
Beijing University of Posts and Telecommunications
M
Mao Qin
Beijing University of Posts and Telecommunications
Y
Yinxiao Chen
Beijing University of Posts and Telecommunications
Chen Peng
Chen Peng
Zhejiang University
RoboticsPath planning and control
Shangguang Wang
Shangguang Wang
Beijing University of Posts and Telecommunications
Service ComputingEdge ComputingSatellite Computing
M
Mengwei Xu
Beijing University of Posts and Telecommunications