MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of mainstream vision-language-action (VLA) models in long-horizon robotic manipulation—specifically their neglect of temporal context and inability to model non-Markovian dependencies—this paper proposes a VLA framework integrating perceptual and cognitive memory. Inspired by cognitive science, we design a dual-path memory architecture: a working memory that dynamically caches low-level perceptual tokens, and a long-term memory that stores high-level semantic representations; these jointly enable dynamic retrieval and fusion of temporal information. Leveraging a pre-trained vision-language model (VLM), we generate multi-granularity tokens and condition a diffusion-based action model on memory states to produce robust action sequences. Evaluated across three simulation benchmarks, our method achieves an average success rate of 73.2%; on 12 real-world long-horizon tasks, it attains 84.0% success—outperforming the state-of-the-art by 26.0%.

Technology Category

Application Category

📝 Abstract
Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA
Problem

Research questions and friction points this paper is trying to address.

Addressing non-Markovian robotic manipulation tasks with temporal dependencies
Overcoming mainstream VLA models' limitation in handling long-horizon tasks
Integrating working memory and long-term memory mechanisms for robotic control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perceptual-Cognitive Memory Bank for storing details
Working memory retrieves and fuses relevant entries
Memory-conditioned diffusion expert yields action sequences
🔎 Similar Papers
No similar papers found.
H
Hao Shi
Department of Automation, BNRist, Tsinghua University
Bin Xie
Bin Xie
InfoBeyond Technology LLC
Mobile ComuptingSecurityBig Data Streaming
Yingfei Liu
Yingfei Liu
Megvii Technology
Lin Sun
Lin Sun
Qihoo 360
large language model
F
Fengrong Liu
Harbin Institute of Technology
Tiancai Wang
Tiancai Wang
Dexmal
Computer VisionEmbodied AI
Erjin Zhou
Erjin Zhou
Megvii Inc.
Computer Vision
Haoqiang Fan
Haoqiang Fan
Megvii
computer vision
X
Xiangyu Zhang
MEGVII Technology, StepFun
G
Gao Huang
Department of Automation, BNRist, Tsinghua University