🤖 AI Summary
Existing vision-language-action (VLA) models struggle to generalize across unseen tasks due to limited transferability of experience across objects, scenes, and action modalities. This work proposes VLA-Pro, a novel framework that introduces procedural memory mechanisms into VLA models for the first time. During training, task-specific LoRA adapters encode procedural memories; at inference, the model retrieves and dynamically fuses relevant memories based on multimodal context to generate action sequences. This approach enables plug-and-play, modular cross-task experience transfer. Evaluated in RoboTwin and RLBench simulation environments, VLA-Pro achieves up to a 207% relative improvement in generalization performance and significantly boosts real-world robotic task success rates from 5.8% to 65.0%.
📝 Abstract
Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.