STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge that mobile GUI agents face in long-horizon tasks due to limited context windows and the high computational cost of frequent screenshots, which hinder effective retention of critical historical information. To overcome this, the authors propose programmatically injecting deterministic memory variables within a controllable virtual environment, enabling precise control over memory content, encoding, and retrieval timing. This approach generates verifiable, memory-annotated supervision data, which is leveraged through a hybrid training strategy combining supervised learning and online reinforcement learning. The method enables, for the first time, fine-grained control and large-scale annotation of memory behaviors, circumventing the annotation difficulties inherent in real-world settings. The resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specific models on the newly introduced Memory-World benchmark, significantly improving memory accuracy, task robustness, and general navigation capability.

📝 Abstract

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

Problem

Research questions and friction points this paper is trying to address.

mobile GUI agents

long-horizon tasks

explicit memory

context window limitation

memory training

Innovation

Methods, ideas, or system contributions that make the work stand out.

explicit memory

virtual environments

mobile GUI agents