Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

GUI instruction grounding—mapping natural language commands to pixel-level interface actions—faces fundamental bottlenecks: lack of software commonsense knowledge, insufficient UI layout understanding, and difficulty in executing fine-grained operations. To address these, we propose OSWorld-G, the first fine-grained GUI grounding benchmark, built upon the 4M-sample Jedi dataset. Our method jointly models software commonsense and layout semantics via multi-perspective task decomposition, UI element disentanglement-and-recomposition, and multi-scale supervision. Crucially, we introduce a cross-interface compositional generalization paradigm and empirically demonstrate—for the first time—that improved grounding capability directly enhances the performance of general-purpose computer agents. OSWorld-G achieves state-of-the-art results on ScreenSpot-v2, ScreenSpot-v2-Pro, and its own benchmark, lifting OSWorld task success rate from 5% to 27%. All data, models, and code are publicly released.

Technology Category

Application Category

📝 Abstract

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

Problem

Research questions and friction points this paper is trying to address.

Mapping natural language to GUI actions for computer agents

Addressing oversimplified grounding tasks in current benchmarks

Enhancing agent capabilities with comprehensive datasets and models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OSWorld-G benchmark with 564 annotated samples

Releases Jedi dataset with 4 million task examples

Trains multi-scale models outperforming existing approaches

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents