Grounding Computer Use Agents on Human Demonstrations

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality datasets for instruction–UI element alignment in desktop environments, this work introduces GroundCUA—the first large-scale desktop interaction grounding dataset—comprising 56K screenshots from 87 real-world applications and over 3.56 million high-precision human expert annotations. We propose GroundNext, a novel model that integrates supervised fine-tuning with reinforcement learning and employs o3 as its planner to achieve fine-grained UI element localization within a multi-scale architecture. Our approach drastically reduces data dependency: using only one-tenth of the training data, it surpasses prior state-of-the-art methods. GroundNext achieves consistent top performance across five benchmarks and matches or exceeds larger models on OSWorld tasks, empirically validating the efficacy of high-quality demonstration-driven learning for enhancing generalization in desktop grounding.

Technology Category

Application Category

📝 Abstract
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing limited desktop grounding datasets for computer agents
Developing models to map instructions to UI elements accurately
Advancing general-purpose computer agents through expert-driven data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale desktop grounding dataset from expert demonstrations
Developed GroundNext models mapping instructions to UI elements
Achieved state-of-the-art results with minimal training data
Aarash Feizi
Aarash Feizi
PhD student in Computer Science, McGill University
Representation LearningSelf-Supervised LearningGraph Representation Learning
Shravan Nayak
Shravan Nayak
Mila
Vision and LanguageCultureGeo-diversityMultilinguality
Xiangru Jian
Xiangru Jian
University of Waterloo
MultimodalityLLMGNNDatabase
Kevin Qinghong Lin
Kevin Qinghong Lin
University of Oxford; National U. of Singapore
Vision and LanguageVideo UnderstandingAI Agent
Kaixin Li
Kaixin Li
National University of Singapore
Machine LearningNatural Language ProcessingCode IntelligenceGUI Agents
Rabiul Awal
Rabiul Awal
Mila, Montreal
deep learningagi
X
Xing Han Lu
Mila - Quebec AI Institute, McGill University
Johan Obando-Ceron
Johan Obando-Ceron
Mila, University of Montreal
Deep LearningReinforcement LearningMachine LearningArtificial Intelligence
Juan A. Rodriguez
Juan A. Rodriguez
Mila - Quebec AI Institute, ETS, ServiceNow Research, ILLS
Artificial IntelligenceDeep LearningComputer VisionMultimodal AIScalable Vector Graphics
Nicolas Chapados
Nicolas Chapados
ServiceNow Research, Mila, Polytechnique Montréal (adjunct)
Deep LearningArtificial IntelligenceStatisticsForecasting
David Vázquez
David Vázquez
ServiceNow Research, ELLIS member
Artificial IntelligenceComputer VisionMultimodal-learningMachine Learning
Adriana Romero-Soriano
Adriana Romero-Soriano
Fundamental AI Research, Meta
deep learningmachine learningAI
Reihaneh Rabbany
Reihaneh Rabbany
Assistant Professor of Computer Science, McGill University; Canada CIFAR AI Chair, Mila
Data MiningMachine LearningGraph MiningNetwork ScienceComputational Social Science
P
Perouz Taslakian
Mila - Quebec AI Institute, McGill University, ServiceNow Research
C
Christopher Pal
Mila - Quebec AI Institute, McGill University, Polytechnique Montréal, ServiceNow Research
Spandana Gella
Spandana Gella
ServiceNow AI Research
Multimodal Foundational ModelsGUI AgentsSafety & Security
Sai Rajeswar
Sai Rajeswar
Staff Research Scientist, Adjunct Professor, Mila, ServiceNow
machine learninggenerative modelsreinforcement learning