MEM: Multi-Scale Embodied Memory for Vision Language Action Models

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of jointly modeling long-term semantic memory and short-term perceptual memory in end-to-end robot learning for complex, multi-stage tasks. The authors propose a multi-scale embodied memory architecture that, for the first time, integrates multimodal and multi-granularity memory mechanisms into robotic policy learning. Specifically, a video encoder compresses short-term visual memory to handle occlusions, while a language model processes long-term semantic memory represented in textual form. These components are unified within a vision–language–action policy framework. The approach successfully executes long-horizon tasks—such as kitchen cleaning and sandwich preparation—lasting up to fifteen minutes, demonstrating the ability to adapt manipulation strategies based on contextual cues.

Technology Category

Application Category

📝 Abstract
Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.
Problem

Research questions and friction points this paper is trying to address.

multi-scale memory
vision-language-action models
long-horizon robotic tasks
embodied memory
memory granularity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Scale Memory
Vision-Language Action Models
Embodied AI
Long-Horizon Robotic Control
Mixed-Modal Memory
🔎 Similar Papers
No similar papers found.