🤖 AI Summary
Existing benchmarks inadequately capture the complexity of computer-use agents across task heterogeneity, capability dimensions, and alignment with user needs—hindering capability-oriented development and real-world deployment. This paper introduces OS-MAP, a novel benchmark comprising 416 realistic daily tasks spanning 15 application categories. It proposes the first five-level automation taxonomy and a demand-grounded generalization hierarchy model, enabling fine-grained evaluation along both breadth (cross-application coverage) and depth (automation granularity) via a performance–generalization assessment matrix. OS-MAP integrates vision-language understanding, task decomposition, cross-application procedural reasoning, and intent modeling. Experiments reveal that even state-of-the-art vision-language-model-driven agents exhibit significant limitations in high-level coordination and generalization across diverse tasks, highlighting autonomy and cross-scenario generalization as fundamental bottlenecks.
📝 Abstract
Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.