🤖 AI Summary
Existing vision-language-action (VLA) models suffer from low data efficiency and limited generalization due to their direct mapping of instructions to control signals. This work proposes PrimitiveVLA, a novel action-centric “decompose-and-compose” framework that first extracts reusable motion primitives through automated trajectory decomposition. It introduces a Multimodal Canonical Representation (MCR) to unify primitive learning and composition within a shared embedding space. Execution is realized via a closed-loop architecture integrating a vision-language model (VLM) planner with a switching module guided by large language model (LLM)-generated high-level plans. This approach substantially improves data efficiency and demonstrates exceptional zero-shot generalization on both unseen tasks and long-horizon sequential tasks.
📝 Abstract
Vision-Language-Action (VLA) models offer a promising paradigm for generalist robotic policies, yet their adaptation is hindered by data inefficiency and poor generalization. We argue that these bottlenecks stem from the prevailing Direct Instruction-to-Control Mapping, which forces models to memorize monolithic trajectories rather than reusable motion patterns, i.e., primitives. We propose PrimitiveVLA, a framework that shifts this paradigm toward a Primitive-Centric Disassemble & Assemble paradigm. Supported by a shared Multimodal Canonical Representation (MCR), PrimitiveVLA unifies two phases: (1) Fine-tuning-phase Disassembly, which uses an automated pipeline to disassemble demonstrations into reusable primitives; and (2) Inference-phase Assembly, which employs a VLM-based planner and an LLM-generated switch module for robust closed-loop execution. By disassembling tasks into reusable primitives, PrimitiveVLA enables VLA models to learn invariant motion patterns instead of task-specific trajectories. Extensive experiments show that our framework improves data efficiency and achieves superior zero-shot generalization across unseen and long-horizon tasks.