GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak cross-domain generalization and inefficient historical information utilization in multimodal large language models (MLLMs) for GUI navigation, this paper proposes a history-aware structured reasoning framework. Methodologically, it integrates supervised fine-tuning using pseudo-labeled trajectories with grouped relative policy optimization—a reinforcement learning approach—enabling joint training of three core modules. Key contributions include: (1) a chain-of-thought structure unifying progress assessment and decision reasoning; (2) a co-optimization mechanism jointly modeling action prediction and historical summarization; and (3) a history-aware reward function design. Evaluated on standard benchmarks, the framework achieves state-of-the-art performance, with significant improvements on cross-domain tasks. Results demonstrate its robustness and scalability in complex, real-world GUI navigation scenarios.

Technology Category

Application Category

📝 Abstract
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, extbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.
Problem

Research questions and friction points this paper is trying to address.

Improves cross-domain generalization for GUI navigation agents
Enhances history utilization through structured reasoning mechanisms
Addresses limitations in multimodal GUI navigation task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured reasoning generates Chain-of-Thought GUI analyses
History summarization creates compact summaries for future steps
Training combines supervised fine-tuning and reinforcement learning
🔎 Similar Papers
T
Tao Liu
ShanghaiTech University
Chongyu Wang
Chongyu Wang
Florida State University
InvestmentReal EstateSustainability
R
Rongjie Li
ByteDance
Yingchen Yu
Yingchen Yu
ByteDance, Singapore
Computer Vision
X
Xuming He
ShanghaiTech University
Bai Song
Bai Song
ByteDance