UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing evaluation frameworks lack standardized, fine-grained benchmarks for assessing autonomous agents’ task automation capabilities in desktop GUI environments. Method: We introduce UI-Vision—the first open-source, permissively licensed, fine-grained benchmark for desktop GUI understanding and interaction. It covers 83 real-world desktop applications and features a densely annotated multimodal dataset derived from authentic screen captures and human demonstrations, including bounding boxes of UI elements, semantic labels, and action sequences. We propose three hierarchical evaluation tasks: Element Grounding, Layout Grounding, and Action Prediction—enabling offline, reproducible, fine-grained assessment. Results: Experiments reveal significant limitations in current large models (e.g., UI-TARS-72B) regarding professional software comprehension, spatial relation reasoning, and complex interactions (e.g., drag-and-drop). UI-Vision fills a critical gap in offline evaluation for desktop GUI agents and provides a quantifiable, diagnosable benchmark to advance research on computer-use agents.

Technology Category

Application Category

📝 Abstract

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

Problem

Research questions and friction points this paper is trying to address.

Develops UI-Vision benchmark for desktop GUI automation evaluation.

Addresses limitations in understanding professional software and complex actions.

Provides dense annotations and tasks to evaluate agent performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

UI-Vision: first comprehensive desktop GUI benchmark

Dense annotations: bounding boxes, UI labels, action trajectories

Three tasks: Element, Layout Grounding, Action Prediction

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces