GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration

📅 2025-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI agents exhibit limited generalization across applications and tasks, primarily because current datasets neglect UI structural variations introduced by developers and only cover simple navigation-oriented tasks. This work introduces GUI-Xplore, the first generalization-focused GUI exploration dataset, comprising multi-scenario exploration videos and a five-level structured downstream task taxonomy. Methodologically, we propose the “exploration–reasoning co-design” paradigm—the first systematic framework for modeling cross-application UI structural discrepancies—and establish a hierarchical task taxonomy spanning navigation, interaction, and comprehension. Technically, our approach integrates action-aware GUI modeling, graph-guided environment reasoning, and multi-stage exploration video representation learning. The resulting Xplore-Agent achieves a 10% improvement in task success rate on unseen applications, demonstrating significantly enhanced zero-shot transfer capability.

Technology Category

Application Category

📝 Abstract
GUI agents hold significant potential to enhance the experience and efficiency of human-device interaction. However, current methods face challenges in generalizing across applications (apps) and tasks, primarily due to two fundamental limitations in existing datasets. First, these datasets overlook developer-induced structural variations among apps, limiting the transferability of knowledge across diverse software environments. Second, many of them focus solely on navigation tasks, which restricts their capacity to represent comprehensive software architectures and complex user interactions. To address these challenges, we introduce GUI-Xplore, a dataset meticulously designed to enhance cross-application and cross-task generalization via an exploration-and-reasoning framework. GUI-Xplore integrates pre-recorded exploration videos providing contextual insights, alongside five hierarchically structured downstream tasks designed to comprehensively evaluate GUI agent capabilities. To fully exploit GUI-Xplore's unique features, we propose Xplore-Agent, a GUI agent framework that combines Action-aware GUI Modeling with Graph-Guided Environment Reasoning. Further experiments indicate that Xplore-Agent achieves a 10% improvement over existing methods in unfamiliar environments, yet there remains significant potential for further enhancement towards truly generalizable GUI agents.
Problem

Research questions and friction points this paper is trying to address.

Addressing generalization challenges across diverse applications and tasks
Overcoming limitations in datasets for GUI agent training
Enhancing GUI agent performance in unfamiliar environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploration-and-reasoning framework enhances generalization
Action-aware GUI Modeling improves agent adaptability
Graph-Guided Environment Reasoning boosts performance
🔎 Similar Papers
No similar papers found.
Y
Yuchen Sun
School of Information Science and Electronic Engineering, Shanghai Jiao Tong University
Shanhui Zhao
Shanhui Zhao
Institute for AI Industry Research (AIR), Tsinghua University
Artificial IntelligenceSoftware TestingLLM-based AgentEdge Computing
T
Tao Yu
Institute for AI Industry Research (AIR), Tsinghua University
H
Hao Wen
Institute for AI Industry Research (AIR), Tsinghua University
Samith Va
Samith Va
Shanghai Jiao Tong University
Computer VisionLarge Language Models
M
Mengwei Xu
Beijing University of Posts and Telecommunications
Yuanchun Li
Yuanchun Li
Institute for AI Industry Research (AIR), Tsinghua University
mobile computingartificial intelligence
Chongyang Zhang
Chongyang Zhang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University