🤖 AI Summary
This work addresses the fragility of conventional GUI agents in long-horizon tasks, where repeated interface re-interpretation at each step degrades performance. The authors propose Executable Agent Memory (EAM), which shifts task planning from free-form generation to retrieval and execution over a structured knowledge graph. EAM compresses multi-step interactions via state-aware depth-first search and action-sequence mining, while a lightweight Q-function guides Monte Carlo Tree Search for efficient planning. Theoretical analysis establishes bias consistency of the Q-model and provides sample complexity bounds for path recovery. Evaluated on AndroidWorld, EAM achieves up to a 19.6% absolute improvement in success rate over UI-TARS-7B, reduces inference token consumption by 6×, and maintains an average latency of only 2.8 seconds.
📝 Abstract
Modern GUI agents typically rely on a model-centric and step-wise interaction paradigm, where LLMs must re-interpret the UI and re-decide actions at every screen, which is fragile in long-horizon tasks. In this paper, we propose Executable Agentic Memory (EAM), a structured Knowledge Graph (KG) that shifts GUI planning from free-form generation to a robust retrieval-and-execution process. Our approach includes a sample-efficient memory construction pipeline using state-aware DFS and action-group mining to compress multi-step routines. To ensure efficient planning, we introduce a value-guided graph search where a lightweight Q-function model steers Monte Carlo Tree Search (MCTS) over the KG. We theoretically establish bias-consistency for the Q-model and derive sample complexity bounds for path recovery. Empirically, EAM outperforms state-of-the-art baselines like UI-TARS-7B by up to $19.6\%$ on AndroidWorld, while reducing token costs $6\times$ relative to GPT-4o. With a $2.8$s average latency, EAM enables reliable, quick, and long-horizon GUI automation.