MEM: Multi-Scale Embodied Memory for Vision Language Action Models

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

257K/year

🤖 AI Summary

This work addresses the challenge of jointly modeling long-term semantic memory and short-term perceptual memory in end-to-end robot learning for complex, multi-stage tasks. The authors propose a multi-scale embodied memory architecture that, for the first time, integrates multimodal and multi-granularity memory mechanisms into robotic policy learning. Specifically, a video encoder compresses short-term visual memory to handle occlusions, while a language model processes long-term semantic memory represented in textual form. These components are unified within a vision–language–action policy framework. The approach successfully executes long-horizon tasks—such as kitchen cleaning and sandwich preparation—lasting up to fifteen minutes, demonstrating the ability to adapt manipulation strategies based on contextual cues.

Technology Category

Application Category

📝 Abstract

Conventionally, memory in end-to-end robotic learning involves inputting a sequence of past observations into the learned policy. However, in complex multi-stage real-world tasks, the robot's memory must represent past events at multiple levels of granularity: from long-term memory that captures abstracted semantic concepts (e.g., a robot cooking dinner should remember which stages of the recipe are already done) to short-term memory that captures recent events and compensates for occlusions (e.g., a robot remembering the object it wants to pick up once its arm occludes it). In this work, our main insight is that an effective memory architecture for long-horizon robotic control should combine multiple modalities to capture these different levels of abstraction. We introduce Multi-Scale Embodied Memory (MEM), an approach for mixed-modal long-horizon memory in robot policies. MEM combines video-based short-horizon memory, compressed via a video encoder, with text-based long-horizon memory. Together, they enable robot policies to perform tasks that span up to fifteen minutes, like cleaning up a kitchen, or preparing a grilled cheese sandwich. Additionally, we find that memory enables MEM policies to intelligently adapt manipulation strategies in-context.

Problem

Research questions and friction points this paper is trying to address.

multi-scale memory

vision-language-action models

long-horizon robotic tasks

embodied memory

memory granularity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Scale Memory

Vision-Language Action Models

Embodied AI