Mem-W: Latent Memory-Native GUI Agents

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the mismatch between external symbolic memory and the latent representations underlying policy decisions in existing GUI agents, which hinders long-horizon reasoning. To overcome this limitation, we propose the first GUI agent natively equipped with implicit memory, encoding historical trajectories and conversational snippets into compact memory tokens via a trajectory-to-latent compressor. These tokens are fused with current interface observations into a unified embedding sequence, enabling end-to-end coordination between memory and decision-making. Our approach innovatively treats memory as a continuous contextual component through a memory-observation joint embedding mechanism and a result-aware self-distillation training strategy. Evaluated on four web and mobile navigation benchmarks, the method substantially outperforms prior approaches, achieving performance gains of up to 30.0 points and validating the efficacy of implicit contextual memory.

📝 Abstract

GUI agents are beginning to operate the web, mobile, and desktop as interactive worlds, where successful control depends on carrying forward visual, procedural, and task-level evidence beyond the fleeting present screen. Yet most agents still treat memory as an external, human-readable artifact: histories are summarized, categorized, retrieved, and reinserted as text or structured records before being encoded again by the policy. This creates a mismatch between the representational form in which experience is stored and the latent embedding sequence over which modern GUI policies actually act. We introduce Mem-W, a series of latent-memory-native GUI agents that treat memory as part of the agent's continuous context rather than as an auxiliary symbolic scaffold. Mem-W weaves both historical trajectories (as experiential memory) and in-session segments (as working memory) into compact memory tokens through a shared trajectory-to-latent compressor. These tokens are woven with the current GUI observation and local context into one continuous embedding sequence, allowing the agent to read successes, failures, and unfinished progress through the same machine-native interface. Mem-W is trained with self-distillation and outcome-aware supervision to preserve decision-relevant state while filtering memory toward evidence that truly supports task success. Across four web and mobile navigation benchmarks, Mem-W consistently improves diverse backbones and memory-enhanced baselines, with gains of up to $+30.0$, suggesting that latent-context-native memory can serve as a scalable foundation for long-horizon GUI agency.

Problem

Research questions and friction points this paper is trying to address.

GUI agents

latent memory

memory representation

long-horizon tasks

embedding mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent memory

GUI agents

memory-native architecture