VPWEM: Non-Markovian Visuomotor Policy with Working and Episodic Memory

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing visuomotor policies struggle with non-Markovian tasks requiring long-term memory due to their reliance on short-horizon observations, often failing under distribution shifts and incurring high computational costs. This work proposes VPWEM, the first approach to integrate human-inspired working and episodic memory mechanisms into visuomotor policy learning. It introduces a recurrent memory compression architecture that maintains recent observations via a sliding window as working memory and employs a Transformer-based memory compressor to recursively condense historical information into a fixed set of episodic memory tokens. This design effectively integrates task-relevant information across the full episode while maintaining approximately constant computation and memory overhead. VPWEM outperforms state-of-the-art methods by over 20% on the memory-intensive MIKASA manipulation benchmark and achieves an average improvement of 5% on the MoMaRT mobile manipulation benchmark.

Technology Category

Application Category

📝 Abstract

Imitation learning from human demonstrations has achieved significant success in robotic control, yet most visuomotor policies still condition on single-step observations or short-context histories, making them struggle with non-Markovian tasks that require long-term memory. Simply enlarging the context window incurs substantial computational and memory costs and encourages overfitting to spurious correlations, leading to catastrophic failures under distribution shift and violating real-time constraints in robotic systems. By contrast, humans can compress important past experiences into long-term memories and exploit them to solve tasks throughout their lifetime. In this paper, we propose VPWEM, a non-Markovian visuomotor policy equipped with working and episodic memories. VPWEM retains a sliding window of recent observation tokens as short-term working memory, and introduces a Transformer-based contextual memory compressor that recursively converts out-of-window observations into a fixed number of episodic memory tokens. The compressor uses self-attention over a cache of past summary tokens and cross-attention over a cache of historical observations, and is trained jointly with the policy. We instantiate VPWEM on diffusion policies to exploit both short-term and episode-wide information for action generation with nearly constant memory and computation per step. Experiments demonstrate that VPWEM outperforms state-of-the-art baselines including diffusion policies and vision-language-action (VLA) models by more than 20% on the memory-intensive manipulation tasks in MIKASA and achieves an average 5% improvement on the mobile manipulation benchmark MoMaRT. Code is available at https://github.com/HarryLui98/code_vpwem.

Problem

Research questions and friction points this paper is trying to address.

non-Markovian tasks

long-term memory

visuomotor policy

distribution shift

real-time constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

non-Markovian policy

working memory

episodic memory