🤖 AI Summary
GUI agents face two core challenges: error accumulation due to historical trajectory dependency, and “decision-first, observation-lag” bias stemming from local exploration, leading to missed critical interface cues. To address these, we propose Memory-Guided Agent (MGA), the first GUI agent adopting an “observe-then-decide” paradigm. MGA constructs context-rich, history-decoupled state representations by jointly encoding the current screenshot, spatial layout features, and a dynamically updated structured memory. Methodologically, it departs from conventional long-chain execution by introducing ternary state modeling—integrating visual, spatial, and memory modalities—to substantially mitigate error propagation and exploration bias. Evaluated on OSWorld, real-world desktop applications (Chrome, VSCode, VLC), and cross-task transfer settings, MGA consistently surpasses state-of-the-art methods, demonstrating superior robustness, generalization, and execution efficiency.
📝 Abstract
The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.