PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the problem of historical information loss and suboptimal decision-making in multimodal large language model (MLLM)-driven GUI agents during long-horizon tasks—caused by memory constraints—this paper proposes an *active retrospection* framework. The method integrates a two-tier summarization mechanism (global semantic summaries + local visual keyframe summaries) with a learnable history screenshot retriever to enable on-demand recall of critical historical observations. Built upon Qwen2.5-VL, the resulting PAL-UI-3B/7B models achieve significant performance gains on a benchmark of 8.6K mobile navigation tasks over strong baselines. Crucially, they demonstrate strong cross-domain generalization, excelling even on unseen web navigation tasks without task-specific fine-tuning. The core contribution is the first introduction of *active retrospection* for long-horizon planning in GUI agents—effectively balancing summary compression efficiency with high-fidelity preservation of visual details.

Technology Category

Application Category

📝 Abstract

Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) promise human-like interaction with software applications, yet long-horizon tasks remain challenging due to memory limitations. Existing approaches either truncate history or rely on simple textual summaries, which risk losing critical information when past visual details become necessary for future decisions. In this paper, we propose extbf{PAL-UI} ( extbf{P}lanning with extbf{A}ctive extbf{L}ook-back), a novel framework that enables GUI agents to adaptively retrieve past observations when required. PAL-UI combines a dual-level summarization agent, capturing both observation-level cues and action-level outcomes, with a dedicated retrieval tool that allows the agent to recall specific historical screenshots during planning. We curate a step-level instruction dataset of 8.6K samples from mobile GUI navigation trajectories and train extbf{PAL-UI-3B} and extbf{PAL-UI-7B} models based on Qwen2.5-VL. Extensive experiments demonstrate that PAL-UI significantly outperforms baseline models and prior methods in mobile GUI navigation tasks, even under data-efficient settings. Moreover, PAL-UI exhibits strong cross-domain generalization, achieving notable improvements in web navigation without additional training. Our work highlights the potential of active memory retrieval for long-horizon planning capabilities of vision-based GUI agents.

Problem

Research questions and friction points this paper is trying to address.

Addressing memory limitations in GUI agents for long-horizon tasks

Overcoming information loss from truncated history in visual interfaces

Enabling adaptive retrieval of past visual observations during planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active retrieval of past visual observations

Dual-level summarization captures cues and outcomes

Enables cross-domain generalization without retraining

🔎 Similar Papers

Visual grounding for desktop graphical user interfaces