🤖 AI Summary
To address weak cross-domain generalization and inefficient historical information utilization in multimodal large language models (MLLMs) for GUI navigation, this paper proposes a history-aware structured reasoning framework. Methodologically, it integrates supervised fine-tuning using pseudo-labeled trajectories with grouped relative policy optimization—a reinforcement learning approach—enabling joint training of three core modules. Key contributions include: (1) a chain-of-thought structure unifying progress assessment and decision reasoning; (2) a co-optimization mechanism jointly modeling action prediction and historical summarization; and (3) a history-aware reward function design. Evaluated on standard benchmarks, the framework achieves state-of-the-art performance, with significant improvements on cross-domain tasks. Results demonstrate its robustness and scalability in complex, real-world GUI navigation scenarios.
📝 Abstract
While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, extbf{GUI-Rise}, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework's ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at https://leon022.github.io/GUI-Rise.