OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately capture the complexity of computer-use agents across task heterogeneity, capability dimensions, and alignment with user needs—hindering capability-oriented development and real-world deployment. This paper introduces OS-MAP, a novel benchmark comprising 416 realistic daily tasks spanning 15 application categories. It proposes the first five-level automation taxonomy and a demand-grounded generalization hierarchy model, enabling fine-grained evaluation along both breadth (cross-application coverage) and depth (automation granularity) via a performance–generalization assessment matrix. OS-MAP integrates vision-language understanding, task decomposition, cross-application procedural reasoning, and intent modeling. Experiments reveal that even state-of-the-art vision-language-model-driven agents exhibit significant limitations in high-level coordination and generalization across diverse tasks, highlighting autonomy and cross-scenario generalization as fundamental bottlenecks.

Technology Category

Application Category

📝 Abstract
Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.
Problem

Research questions and friction points this paper is trying to address.

Assessing agent capabilities and alignment with user demands
Evaluating agent autonomy and generalization in real-world tasks
Identifying limitations in perception, reasoning, and coordination tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

OS-MAP benchmark for daily computer automation tasks
Five-level taxonomy for agent autonomy assessment
Performance-generalization matrix for structured evaluation
🔎 Similar Papers
No similar papers found.
X
Xuetian Chen
Fudan University, Shanghai AI Lab
Y
Yinghao Chen
Shanghai AI Lab, Tsinghua University
X
Xinfeng Yuan
Fudan University, Shanghai AI Lab
Z
Zhuo Peng
Fudan University
L
Lu Chen
Fudan University
Y
Yuekeng Li
Fudan University
Z
Zhoujia Zhang
Fudan University
Y
Yingqian Huang
Fudan University
L
Leyan Huang
Fudan University
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
Tianbao Xie
Tianbao Xie
University of Hong Kong
Artificial IntelligenceDeep LearningNatural Language Processing
Z
Zhiyong Wu
The University of Hong Kong
Qiushi Sun
Qiushi Sun
The University of Hong Kong, National University of Singapore
Natural Language ProcessingAgentsCode Intelligence
B
Biqing Qi
Shanghai AI Lab
B
Bowen Zhou
Shanghai AI Lab