GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Proprietary real-world GUI environments (e.g., desktop and mobile applications) hinder agent-based navigation research due to limited accessibility and incomplete state observability. Method: This paper introduces (1) an open-source, configurable GUI simulation engine that models screen layouts, icon semantics, and navigation graphs, enabling controllable training and evaluation for complex cross-screen tasks; and (2) a staged reinforcement learning framework integrating supervised fine-tuning, single-turn RL, and multi-turn RL to explicitly model long-horizon navigation policies and environment exploration. Results: Experiments on both static and interactive benchmarks demonstrate significant improvements in screen navigation accuracy and generalization. Moreover, the approach exhibits strong transfer capability to real-world applications, validating its practical efficacy and robustness.

Technology Category

Application Category

📝 Abstract
With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.
Problem

Research questions and friction points this paper is trying to address.

Addresses complex GUI navigation challenges in agent tasks
Introduces a simulation environment for comprehensive agent training
Demonstrates reinforcement learning enhances GUI navigation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation environment engine for GUI navigation research
Multi-turn reinforcement learning for exploration strategies
Supervised fine-tuning as foundation for agent training
🔎 Similar Papers
No similar papers found.
H
Haolong Yan
Beijing University of Posts and Telecommunications
Y
Yeqing Shen
StepFun
X
Xin Huang
Waseda University
J
Jia Wang
StepFun
K
Kaijun Tan
StepFun
Zhixuan Liang
Zhixuan Liang
University of Hong Kong
Embodied AIMachine LearningRoboticsComputer Vision
H
Hongxin Li
Institute of Automation, Chinese Academy of Sciences
Zheng Ge
Zheng Ge
Senior Researcher, StepFun
Multimodal Models Perception and Reasoning
Osamu Yoshie
Osamu Yoshie
waseda university
S
Si Li
Beijing University of Posts and Telecommunications
X
Xiangyu Zhang
StepFun
Daxin Jiang
Daxin Jiang
Co-Founder & CEO, StepFun Corporation
Deep LearningFoundation Models