PrimitiveVLA: Learning Reusable Motion Primitives for Efficient and Generalizable Robotic Manipulation

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models suffer from low data efficiency and limited generalization due to their direct mapping of instructions to control signals. This work proposes PrimitiveVLA, a novel action-centric “decompose-and-compose” framework that first extracts reusable motion primitives through automated trajectory decomposition. It introduces a Multimodal Canonical Representation (MCR) to unify primitive learning and composition within a shared embedding space. Execution is realized via a closed-loop architecture integrating a vision-language model (VLM) planner with a switching module guided by large language model (LLM)-generated high-level plans. This approach substantially improves data efficiency and demonstrates exceptional zero-shot generalization on both unseen tasks and long-horizon sequential tasks.
📝 Abstract
Vision-Language-Action (VLA) models offer a promising paradigm for generalist robotic policies, yet their adaptation is hindered by data inefficiency and poor generalization. We argue that these bottlenecks stem from the prevailing Direct Instruction-to-Control Mapping, which forces models to memorize monolithic trajectories rather than reusable motion patterns, i.e., primitives. We propose PrimitiveVLA, a framework that shifts this paradigm toward a Primitive-Centric Disassemble & Assemble paradigm. Supported by a shared Multimodal Canonical Representation (MCR), PrimitiveVLA unifies two phases: (1) Fine-tuning-phase Disassembly, which uses an automated pipeline to disassemble demonstrations into reusable primitives; and (2) Inference-phase Assembly, which employs a VLM-based planner and an LLM-generated switch module for robust closed-loop execution. By disassembling tasks into reusable primitives, PrimitiveVLA enables VLA models to learn invariant motion patterns instead of task-specific trajectories. Extensive experiments show that our framework improves data efficiency and achieves superior zero-shot generalization across unseen and long-horizon tasks.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
data inefficiency
generalization
motion primitives
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

motion primitives
Vision-Language-Action models
data efficiency
zero-shot generalization
Multimodal Canonical Representation
Y
Yutai Li
State Key Lab of Processors, Institute of Computing Technology, CAS; Jiangsu Key Laboratory of AI for Industries, Institute of AI for Industries, CAS; University of Chinese Academy of Sciences; Cambricon Technologies
Shaohui Peng
Shaohui Peng
Institute of Software Chinese Academy of Sciences
Embodied AIReinforcement Learning
Jiaming Guo
Jiaming Guo
Institute of Computing Technology, Chinese Academy of Sciences
Artificial intelligenceReinforcement Learning
Di Huang
Di Huang
ICT, CAS
Zihao Zhang
Zihao Zhang
天津大学
计算机视觉
Y
Yuxuan Guo
State Key Lab of Processors, Institute of Computing Technology, CAS; University of Science and Technology of China
Y
Yunkai Gao
Jiangsu Key Laboratory of AI for Industries, Institute of AI for Industries, CAS
S
Siming Lan
Jiangsu Key Laboratory of AI for Industries, Institute of AI for Industries, CAS
L
Ling Li
Intelligent Software Research Center, Institute of Software, CAS
Xing Hu
Xing Hu
Institute of Computing Technology, Chinese Academy of Sciences
micro-architectureDeep learning architecture
Yunji Chen
Yunji Chen
Institute of Computing Technology, Chinese Academy of Sciences
processor architecturemicroarchitecturemachine learning