AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Existing UI datasets struggle to balance scale and fine-grained functional annotation: large-scale datasets lack context-aware functional labels, while high-quality functional descriptions are limited to small samples. To address this, we propose an LLM-driven paradigm for automatic GUI element functional annotation, which models UI state transitions through simulated interactions and integrates LLM-based reasoning, self-verification, and rejection sampling—enabling high-precision, fine-grained annotation without human intervention. We introduce and publicly release AutoGUI-704k, the first large-scale, multi-device, multi-resolution GUI functional dataset. Experiments demonstrate that its annotation quality matches human performance; it significantly improves vision-language model (VLM) capabilities on UI referring tasks and achieves state-of-the-art results across multiple GUI understanding benchmarks.

Technology Category

Application Category

📝 Abstract

User interface understanding with vision-language models has received much attention due to its potential for enabling next-generation software automation. However, existing UI datasets either only provide large-scale context-free element annotations or contextualized functional descriptions for elements at a much smaller scale. In this work, we propose the methodname{} pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid and incorrect annotations without human labor. We construct an methodname{}-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets. Human evaluation shows that the AutoGUI pipeline achieves annotation correctness comparable to trained human annotators. Extensive experimental results show that our methodname{}-704k dataset remarkably enhances VLM's UI grounding capabilities, exhibits significant scaling effects, and outperforms existing web pre-training data types. We envision AutoGUI as a scalable pipeline for generating massive data to build GUI-oriented VLMs. AutoGUI dataset can be viewed at this anonymous URL: https://autogui-project.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Automate GUI functionality annotation

Leverage LLMs for UI understanding

Enhance VLM's UI grounding capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for UI functionality annotation

Automatic rejection and verification

Multi-resolution multi-device dataset construction

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents