🤖 AI Summary
To address the problem of historical information loss and suboptimal decision-making in multimodal large language model (MLLM)-driven GUI agents during long-horizon tasks—caused by memory constraints—this paper proposes an *active retrospection* framework. The method integrates a two-tier summarization mechanism (global semantic summaries + local visual keyframe summaries) with a learnable history screenshot retriever to enable on-demand recall of critical historical observations. Built upon Qwen2.5-VL, the resulting PAL-UI-3B/7B models achieve significant performance gains on a benchmark of 8.6K mobile navigation tasks over strong baselines. Crucially, they demonstrate strong cross-domain generalization, excelling even on unseen web navigation tasks without task-specific fine-tuning. The core contribution is the first introduction of *active retrospection* for long-horizon planning in GUI agents—effectively balancing summary compression efficiency with high-fidelity preservation of visual details.
📝 Abstract
Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) promise human-like interaction with software applications, yet long-horizon tasks remain challenging due to memory limitations. Existing approaches either truncate history or rely on simple textual summaries, which risk losing critical information when past visual details become necessary for future decisions. In this paper, we propose extbf{PAL-UI} ( extbf{P}lanning with extbf{A}ctive extbf{L}ook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required. PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool that allows the agent to recall specific historical screenshots during planning. We curate a step-level instruction dataset of 8.6K samples from mobile GUI navigation trajectories and train extbf{PAL-UI-3B} and extbf{PAL-UI-7B} models based on Qwen2.5-VL. Extensive experiments demonstrate that PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings. Moreover, PAL-UI exhibits strong cross-domain generalization, achieving notable improvements in web navigation without additional training. Our work highlights the potential of active memory retrieval for long-horizon planning capabilities of vision-based GUI agents.