🤖 AI Summary
To address the critical bottleneck of scarce fine-grained vision-language data for desktop GUIs hindering GUI agent development, this paper introduces DeskVision—the first large-scale region-level image captioning dataset for desktop GUIs—and its companion evaluation benchmark, DeskVision-Eval. We propose AutoCaptioner, an automated annotation pipeline that synergistically integrates CLIP and OCR to generate high-quality pseudo-labels, enabling end-to-end GUI element localization and descriptive modeling. Furthermore, we design GUIExplorer, a lightweight and efficient model achieving state-of-the-art performance without relying on complex architectures. Experiments demonstrate that training on DeskVision significantly improves the GUI understanding accuracy and cross-system generalization of mainstream large vision-language models (LVLMs) across diverse operating systems and UI paradigms. DeskVision-Eval establishes the first fine-grained GUI evaluation benchmark grounded in real-world desktop environments, covering realistic interaction scenarios, layout variations, and multi-application contexts.
📝 Abstract
The limitation of graphical user interface (GUI) data has been a significant barrier to the development of GUI agents today, especially for the desktop / computer use scenarios. To address this, we propose an automated GUI data generation pipeline, AutoCaptioner, which generates data with rich descriptions while minimizing human effort. Using AutoCaptioner, we created a novel large-scale desktop GUI dataset, DeskVision, along with the largest desktop test benchmark, DeskVision-Eval, which reflects daily usage and covers diverse systems and UI elements, each with rich descriptions. With DeskVision, we train a new GUI understanding model, GUIExplorer. Results show that GUIExplorer achieves state-of-the-art (SOTA) performance in understanding/grounding visual elements without the need for complex architectural designs. We further validated the effectiveness of the DeskVision dataset through ablation studies on various large visual language models (LVLMs). We believe that AutoCaptioner and DeskVision will significantly advance the development of GUI agents, and will open-source them for the community.