Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the limitations of current computer-using agents in reliably executing long-tail, complex, and infrequent human-computer interaction tasks, primarily due to the scarcity of multimodal interaction data. To bridge this gap, the authors introduce CUActSpot—the first comprehensive evaluation benchmark encompassing five modalities: graphical user interfaces (GUIs), text, tables, canvases, and natural images—alongside diverse operational actions. They further develop a renderer-based, extensible synthetic data generation framework that automatically produces training samples annotated with natural language instructions and precise action trajectories. Leveraging this data, the Phi-Ground-Any-4B model demonstrates substantial performance gains over all open-source models with fewer than 32 billion parameters on complex interactive tasks, significantly enhancing agent reliability in understanding and executing long-tail operations.

📝 Abstract

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

Problem

Research questions and friction points this paper is trying to address.

computer-use agents

complex GUI interactions

data scarcity

long-tail distribution

interaction benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

data synthesis

benchmark

computer-use agents

multimodal interaction

rendering pipeline

🔎 Similar Papers

No similar papers found.