š¤ AI Summary
This work addresses the limited generalization of large behavioral models in robotic manipulation due to the scarcity of robot data. By leveraging 4,000 hours of human and robot manipulation data alongside 50 million visionālanguage samples, the study systematically evaluates five heterogeneous co-training modalitiesāincluding cross-embodiment robot data, human videos, densely annotated language, and discrete action tokensāunder single- and multi-stage training strategies to construct a unified visionālanguageāaction policy. It presents the first large-scale empirical analysis of how different modalities influence generalization, task transfer, and language following, demonstrating that effective modality combinations yield cumulative gains and recover semantic understanding inherent in visionālanguage models. Extensive validation across 58,000 simulated trials and 2,835 real-world experiments shows that integrating visionālanguage pretraining with cross-embodiment data significantly enhances generalization and enables rapid adaptation to unseen long-horizon dexterous tasks.
š Abstract
Large behavior models have shown strong dexterous manipulation capabilities by extending imitation learning to large-scale training on multi-task robot data, yet their generalization remains limited by the insufficient robot data coverage. To expand this coverage without costly additional data collection, recent work relies on co-training: jointly learning from target robot data and heterogeneous data modalities. However, how different co-training data modalities and strategies affect policy performance remains poorly understood. We present a large-scale empirical study examining five co-training data modalities: standard vision-language data, dense language annotations for robot trajectories, cross-embodiment robot data, human videos, and discrete robot action tokens across single- and multi-phase training strategies. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples to train vision-language-action policies. We evaluate 89 policies over 58,000 simulation rollouts and 2,835 real-world rollouts. Our results show that co-training with forms of vision-language and cross-embodiment robot data substantially improves generalization to distribution shifts, unseen tasks, and language following, while discrete action token variants yield no significant benefits. Combining effective modalities produces cumulative gains and enables rapid adaptation to unseen long-horizon dexterous tasks via fine-tuning. Training exclusively on robot data degrades the visiolinguistic understanding of the vision-language model backbone, while co-training with effective modalities restores these capabilities. Explicitly conditioning action generation on chain-of-thought traces learned from co-training data does not improve performance in our simulation benchmark. Together, these results provide practical guidance for building scalable generalist robot policies.