🤖 AI Summary
This work aims to achieve general-purpose robotic policies capable of zero-shot deployment across diverse morphologies without morphology-specific fine-tuning. To this end, the authors propose Language-Action Pretraining (LAP), a method that represents low-level actions as natural language tokens, aligning their input-output distribution with that of vision-language models. By unifying action prediction and visual question answering within a co-training framework, LAP enables significant zero-shot transfer to unseen robot morphologies—without requiring custom tokenizers, costly annotations, or morphology-specific architectures. The resulting LAP-3B model achieves an average zero-shot success rate exceeding 50% across multiple novel robots and manipulation tasks, representing approximately a two-fold improvement over the current state-of-the-art vision-language-action (VLA) models.
📝 Abstract
A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.