AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing GUI agent benchmarks struggle to evaluate deep understanding of implicit functional logic and post-interaction state changes in graphical user interfaces. To address this limitation, this work introduces AutoGUI-v2, a multimodal benchmark that pioneers a recursive annotation pipeline combining vision-language models (VLMs) with human annotators to construct hierarchical functional regions from screenshots across multiple platforms. The benchmark systematically evaluates agents’ capabilities in semantic comprehension, element localization, and dynamic state prediction, encompassing six operating systems and 2,753 tasks. Experimental results reveal that open-source fine-tuned models excel at localization, while commercial models demonstrate stronger descriptive abilities; however, all models exhibit significant weaknesses in handling complex or infrequent interaction logic, underscoring that deep functional understanding remains a fundamental challenge.

Technology Category

Application Category

📝 Abstract

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative pipeline that recursively parses multi-platform screenshots into hierarchical functional regions to generate diverse evaluation tasks. Providing 2,753 tasks across six operating systems, AutoGUI-v2 rigorously tests agents on region and element-level semantics, grounding, and dynamic state prediction. Our evaluation reveals a striking dichotomy in VLMs: while open-source models fine-tuned on agent data (e.g., Qwen3-VL) excel at functional grounding, commercial models (e.g., Gemini-2.5-Pro-Thinking) dominate in functionality captioning. Crucially, all models struggle with complex interaction logic of uncommon actions, highlighting that deep functional understanding remains a significant hurdle. By systematically measuring these foundational capabilities, AutoGUI-v2 offers a new lens for advancing the next generation of GUI agents.

Problem

Research questions and friction points this paper is trying to address.

GUI understanding

functionality comprehension

interaction outcome prediction

digital autonomy

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI functionality understanding

interaction outcome prediction

VLM-human collaborative pipeline