MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical gap in existing mobile GUI agent evaluation benchmarks, which largely overlook memory capabilities—particularly the ability to learn across sessions. To this end, we propose MemGUI-Bench, the first memory-centric benchmark comprising 128 high-memory-proportion tasks designed to systematically assess agents’ memory through cross-temporal and cross-spatial challenges. We introduce a taxonomy of memory capabilities tailored for mobile GUI agents, a Progressive Scrutiny evaluation mechanism, and a seven-level fine-grained metric suite. An automated evaluation pipeline, MemGUI-Eval, integrates LLM-as-judge with pass@k strategies for robust assessment. Experiments on 11 state-of-the-art agents reveal five distinct memory failure modes, leading to five actionable design recommendations. All resources will be open-sourced and actively maintained.

Technology Category

Application Category

📝 Abstract
Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textbf{\textit{fully open-sourced and continuously maintained}} at https://lgy0404.github.io/MemGUI-Bench/.
Problem

Research questions and friction points this paper is trying to address.

mobile GUI agents
memory evaluation
benchmarking
cross-session learning
memory deficits
Innovation

Methods, ideas, or system contributions that make the work stand out.

memory benchmark
mobile GUI agents
cross-session learning
LLM-as-judge
Progressive Scrutiny
🔎 Similar Papers
No similar papers found.