AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

πŸ“… 2026-04-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

226K/year
πŸ€– AI Summary
Existing GUI agent benchmarks struggle to evaluate deep understanding of implicit functional logic and post-interaction state changes in graphical user interfaces. To address this limitation, this work introduces AutoGUI-v2, a multimodal benchmark that pioneers a recursive annotation pipeline combining vision-language models (VLMs) with human annotators to construct hierarchical functional regions from screenshots across multiple platforms. The benchmark systematically evaluates agents’ capabilities in semantic comprehension, element localization, and dynamic state prediction, encompassing six operating systems and 2,753 tasks. Experimental results reveal that open-source fine-tuned models excel at localization, while commercial models demonstrate stronger descriptive abilities; however, all models exhibit significant weaknesses in handling complex or infrequent interaction logic, underscoring that deep functional understanding remains a fundamental challenge.

Technology Category

Application Category

πŸ“ Abstract
Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.
Problem

Research questions and friction points this paper is trying to address.

GUI understanding
functionality comprehension
interaction outcome prediction
digital autonomy
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI functionality understanding
interaction outcome prediction
VLM-human collaborative pipeline
hierarchical functional grounding
multi-modal benchmark
πŸ”Ž Similar Papers
H
Hongxin Li
University of Chinese Academy of Sciences (UCAS); New Laboratory of Pattern Recognition (NLPR), CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA
X
Xiping Wang
University of Chinese Academy of Sciences (UCAS); New Laboratory of Pattern Recognition (NLPR), CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA
Jingran Su
Jingran Su
The Hong Kong Polytechnic University
Embodied AI
Z
Zheng Ju
University of Chinese Academy of Sciences (UCAS); New Laboratory of Pattern Recognition (NLPR), CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA
Yuntao Chen
Yuntao Chen
Miromind
agentic aimultimodal modelcomputer vision
Qing Li
Qing Li
Chair Professor (Data Science), the Hong Kong Polytechnic University
databasedata warehousemultimedia retrievalweb servicese-learning
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning