MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

GUI agents face two core challenges: error accumulation due to historical trajectory dependency, and “decision-first, observation-lag” bias stemming from local exploration, leading to missed critical interface cues. To address these, we propose Memory-Guided Agent (MGA), the first GUI agent adopting an “observe-then-decide” paradigm. MGA constructs context-rich, history-decoupled state representations by jointly encoding the current screenshot, spatial layout features, and a dynamically updated structured memory. Methodologically, it departs from conventional long-chain execution by introducing ternary state modeling—integrating visual, spatial, and memory modalities—to substantially mitigate error propagation and exploration bias. Evaluated on OSWorld, real-world desktop applications (Chrome, VSCode, VLC), and cross-task transfer settings, MGA consistently surpasses state-of-the-art methods, demonstrating superior robustness, generalization, and execution efficiency.

Technology Category

Application Category

📝 Abstract

The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.

Problem

Research questions and friction points this paper is trying to address.

Addresses error propagation from historical trajectory dependence

Overcomes local exploration bias in GUI interaction mechanisms

Enhances robustness and generalization in desktop/web interface navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Observe-first then decide interaction principle

Independent step modeling with triad representation

Dynamic structured memory for context enrichment

🔎 Similar Papers

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents