GUI-360$^circ$: A Comprehensive Dataset and Benchmark for Computer-Using Agents

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

为解决计算机使用代理(CUAs)面临的真实任务稀缺、多模态轨迹自动标注缺失及统一评估基准不足问题，提出GUI-360°数据集，采用LLM增强的自动化流程构建包含120万动作步骤的办公应用轨迹数据，支持GUI定位、屏幕解析和动作预测任务评估。

Technology Category

Application Category

📝 Abstract

We introduce GUI-360$^circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision--language models on GUI-360$^circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on https://huggingface.co/datasets/vyokky/GUI-360.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of real-world computer-using agent tasks

Automating multimodal trajectory collection and annotation pipelines

Creating unified benchmark for GUI grounding and action prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-augmented automated pipeline for data collection

Hybrid GUI and API action space for agents

Benchmark suite evaluating grounding and prediction

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

2024-10-07arXiv.orgCitations: 17

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

2024-06-16Citations: 20

AgentStudio: A Toolkit for Building General Virtual Agents

2024-03-26arXiv.orgCitations: 8