MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models lack explicit memory mechanisms, limiting their capability in long-horizon robotic manipulation tasks. To address this, we propose a plug-and-play memory-augmented prompting framework: (1) a soft-prompt-based memory bank explicitly encodes task-phase knowledge from historical demonstrations; and (2) a trajectory-similarity-driven dynamic retrieval mechanism enables real-time matching and fusion of relevant historical information—without fine-tuning the frozen pre-trained VLA backbone. Our approach preserves model integrity while significantly enhancing long-term task continuity and cross-task generalization. In simulation, it improves success rate by 7.0%; on real robots, action success rate increases by up to 25.0%, surpassing state-of-the-art methods. The framework is computationally lightweight, fully compatible with existing VLA models, and requires no architectural modification or parameter updates to the core model.

Technology Category

Application Category

📝 Abstract
Pre-trained Vision-Language-Action (VLA) models have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. However, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP-VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and-play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLA models for long-horizon robotic manipulation tasks
Addressing memory limitations in pre-trained vision-language-action models
Improving action generation using demonstration-derived memory prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-augmented prompting enhances VLA models
Retrieves relevant memory via trajectory similarity matching
Plug-and-play module enables lightweight task improvement
🔎 Similar Papers
2024-04-02IEEE/RJS International Conference on Intelligent RObots and SystemsCitations: 0
R
Runhao Li
Nanyang Technological University, Singapore, Singapore
W
Wenkai Guo
Nanyang Technological University, Singapore, Singapore
Z
Zhenyu Wu
Beijing University of Posts and Telecommunications, Beijing, China
C
Changyuan Wang
Tsinghua University, Beijing, China
Haoyuan Deng
Haoyuan Deng
Nanyang Technological University
RoboticsImitation LearningReinforcement Learning
Z
Zhenyu Weng
South China University of Technology, Guangzhou, China
Y
Yap-Peng Tan
VinUniversity, Hanoi, Vietnam
Z
Ziwei Wang
Nanyang Technological University, Singapore, Singapore