MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GUI agents face two core challenges: error accumulation due to historical trajectory dependency, and “decision-first, observation-lag” bias stemming from local exploration, leading to missed critical interface cues. To address these, we propose Memory-Guided Agent (MGA), the first GUI agent adopting an “observe-then-decide” paradigm. MGA constructs context-rich, history-decoupled state representations by jointly encoding the current screenshot, spatial layout features, and a dynamically updated structured memory. Methodologically, it departs from conventional long-chain execution by introducing ternary state modeling—integrating visual, spatial, and memory modalities—to substantially mitigate error propagation and exploration bias. Evaluated on OSWorld, real-world desktop applications (Chrome, VSCode, VLC), and cross-task transfer settings, MGA consistently surpasses state-of-the-art methods, demonstrating superior robustness, generalization, and execution efficiency.

Technology Category

Application Category

📝 Abstract
The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.
Problem

Research questions and friction points this paper is trying to address.

Addresses error propagation from historical trajectory dependence
Overcomes local exploration bias in GUI interaction mechanisms
Enhances robustness and generalization in desktop/web interface navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Observe-first then decide interaction principle
Independent step modeling with triad representation
Dynamic structured memory for context enrichment
🔎 Similar Papers
No similar papers found.
Weihua Cheng
Weihua Cheng
Shanghaitech University
LLMAgentKnowledge Tracing
Ersheng Ni
Ersheng Ni
The University of Queensland
Data mining
Wenlong Wang
Wenlong Wang
Harbin Institute of Technology
Geophysics
Y
Yifei Sun
East China University of Science and Technology, Shanghai, China
J
Junming Liu
Tongji University, Shanghai, China
W
Wangyu Shen
Shanghaitech University, Shanghai, China
Yirong Chen
Yirong Chen
Stanford University
Botian Shi
Botian Shi
Shanghai Artificial Intelligence Laboratory
VLMsDocument UnderstandingAutonomous Driving
D
Ding Wang
Shanghai AI Laboratory, Shanghai, China