MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the limitations of mainstream vision-language-action (VLA) models in long-horizon robotic manipulation—specifically their neglect of temporal context and inability to model non-Markovian dependencies—this paper proposes a VLA framework integrating perceptual and cognitive memory. Inspired by cognitive science, we design a dual-path memory architecture: a working memory that dynamically caches low-level perceptual tokens, and a long-term memory that stores high-level semantic representations; these jointly enable dynamic retrieval and fusion of temporal information. Leveraging a pre-trained vision-language model (VLM), we generate multi-granularity tokens and condition a diffusion-based action model on memory states to produce robust action sequences. Evaluated across three simulation benchmarks, our method achieves an average success rate of 73.2%; on 12 real-world long-horizon tasks, it attains 84.0% success—outperforming the state-of-the-art by 26.0%.

Technology Category

Application Category

📝 Abstract

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

Problem

Research questions and friction points this paper is trying to address.

Addressing non-Markovian robotic manipulation tasks with temporal dependencies

Overcoming mainstream VLA models' limitation in handling long-horizon tasks

Integrating working memory and long-term memory mechanisms for robotic control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perceptual-Cognitive Memory Bank for storing details

Working memory retrieves and fuses relevant entries

Memory-conditioned diffusion expert yields action sequences

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance