AUTO-Explorer: Automated Data Collection for GUI Agent

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI exploration methods suffer from poor generalization due to scarce, platform-specific training data—particularly for desktop applications and emerging web interfaces. Method: This paper introduces Auto-Explorer, an autonomous GUI exploration framework, and UIXplore, a standardized benchmark. Auto-Explorer integrates multimodal large language models (MLLMs) with screenshot understanding and fine-grained UI element parsing to enable zero-shot, human-free traversal of unseen GUI environments, generating high-quality structured interaction traces at low annotation cost. Contribution/Results: (1) It establishes the first unified cross-platform (web/desktop) exploration paradigm; (2) it improves exploration efficiency via dynamic action-space modeling and feedback-driven policy adaptation; (3) UIXplore provides a reproducible, quantitative evaluation protocol for exploration quality. Experiments demonstrate significant gains in MLLM task completion rates and out-of-distribution generalization across previously unseen software interfaces.

Technology Category

Application Category

📝 Abstract
Recent advancements in GUI agents have significantly expanded their ability to interpret natural language commands to manage software interfaces. However, acquiring GUI data remains a significant challenge. Existing methods often involve designing automated agents that browse URLs from the Common Crawl, using webpage HTML to collect screenshots and corresponding annotations, including the names and bounding boxes of UI elements. However, this method is difficult to apply to desktop software or some newly launched websites not included in the Common Crawl. While we expect the model to possess strong generalization capabilities to handle this, it is still crucial for personalized scenarios that require rapid and perfect adaptation to new software or websites. To address this, we propose an automated data collection method with minimal annotation costs, named Auto-Explorer. It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments, gathering data efficiently. Additionally, to assess the quality of exploration, we have developed the UIXplore benchmark. This benchmark creates environments for explorer agents to discover and save software states. Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set to evaluate the effectiveness of the exploration strategies. Our experiments demonstrate the superior performance of Auto-Explorer, showing that our method can quickly enhance the capabilities of an MLLM in explored software.
Problem

Research questions and friction points this paper is trying to address.

Automating GUI data collection for desktop software
Addressing limitations of web-based GUI agent training
Enhancing MLLM adaptation to new software interfaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated GUI exploration with minimal annotation costs
Autonomous parsing and exploration of GUI environments
Multimodal model fine-tuning using collected GUI data
🔎 Similar Papers
No similar papers found.
X
Xiangwu Guo
Show Lab, National University of Singapore
Difei Gao
Difei Gao
National U. of Singapore; Institute of Computing Technology, Chinese Academy of Sciences
Artificial IntelligenceAI AgentVision and Language
M
Mike Zheng Shou
Show Lab, National University of Singapore