MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

252K/year

🤖 AI Summary

This work addresses the critical gap in existing mobile GUI agent evaluation benchmarks, which largely overlook memory capabilities—particularly the ability to learn across sessions. To this end, we propose MemGUI-Bench, the first memory-centric benchmark comprising 128 high-memory-proportion tasks designed to systematically assess agents’ memory through cross-temporal and cross-spatial challenges. We introduce a taxonomy of memory capabilities tailored for mobile GUI agents, a Progressive Scrutiny evaluation mechanism, and a seven-level fine-grained metric suite. An automated evaluation pipeline, MemGUI-Eval, integrates LLM-as-judge with pass@k strategies for robust assessment. Experiments on 11 state-of-the-art agents reveal five distinct memory failure modes, leading to five actionable design recommendations. All resources will be open-sourced and actively maintained.

Technology Category

Application Category

📝 Abstract

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textbf{\textit{fully open-sourced and continuously maintained}} at https://lgy0404.github.io/MemGUI-Bench/.

Problem

Research questions and friction points this paper is trying to address.

mobile GUI agents

memory evaluation

benchmarking

cross-session learning

memory deficits

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory benchmark

mobile GUI agents

cross-session learning