Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the limitations of current computer-using agents in reliably executing long-tail, complex, and infrequent human-computer interaction tasks, primarily due to the scarcity of multimodal interaction data. To bridge this gap, the authors introduce CUActSpot—the first comprehensive evaluation benchmark encompassing five modalities: graphical user interfaces (GUIs), text, tables, canvases, and natural images—alongside diverse operational actions. They further develop a renderer-based, extensible synthetic data generation framework that automatically produces training samples annotated with natural language instructions and precise action trajectories. Leveraging this data, the Phi-Ground-Any-4B model demonstrates substantial performance gains over all open-source models with fewer than 32 billion parameters on complex interactive tasks, significantly enhancing agent reliability in understanding and executing long-tail operations.
📝 Abstract
Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
Problem

Research questions and friction points this paper is trying to address.

computer-use agents
complex GUI interactions
data scarcity
long-tail distribution
interaction benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

data synthesis
benchmark
computer-use agents
multimodal interaction
rendering pipeline
🔎 Similar Papers
No similar papers found.