A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limited generalization of large behavioral models in robotic manipulation due to the scarcity of robot data. By leveraging 4,000 hours of human and robot manipulation data alongside 50 million vision–language samples, the study systematically evaluates five heterogeneous co-training modalities—including cross-embodiment robot data, human videos, densely annotated language, and discrete action tokens—under single- and multi-stage training strategies to construct a unified vision–language–action policy. It presents the first large-scale empirical analysis of how different modalities influence generalization, task transfer, and language following, demonstrating that effective modality combinations yield cumulative gains and recover semantic understanding inherent in vision–language models. Extensive validation across 58,000 simulated trials and 2,835 real-world experiments shows that integrating vision–language pretraining with cross-embodiment data significantly enhances generalization and enables rapid adaptation to unseen long-horizon dexterous tasks.

Technology Category

Application Category

📝 Abstract

Large behavior models have shown strong dexterous manipulation capabilities by extending imitation learning to large-scale training on multi-task robot data, yet their generalization remains limited by the insufficient robot data coverage. To expand this coverage without costly additional data collection, recent work relies on co-training: jointly learning from target robot data and heterogeneous data modalities. However, how different co-training data modalities and strategies affect policy performance remains poorly understood. We present a large-scale empirical study examining five co-training data modalities: standard vision-language data, dense language annotations for robot trajectories, cross-embodiment robot data, human videos, and discrete robot action tokens across single- and multi-phase training strategies. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples to train vision-language-action policies. We evaluate 89 policies over 58,000 simulation rollouts and 2,835 real-world rollouts. Our results show that co-training with forms of vision-language and cross-embodiment robot data substantially improves generalization to distribution shifts, unseen tasks, and language following, while discrete action token variants yield no significant benefits. Combining effective modalities produces cumulative gains and enables rapid adaptation to unseen long-horizon dexterous tasks via fine-tuning. Training exclusively on robot data degrades the visiolinguistic understanding of the vision-language model backbone, while co-training with effective modalities restores these capabilities. Explicitly conditioning action generation on chain-of-thought traces learned from co-training data does not improve performance in our simulation benchmark. Together, these results provide practical guidance for building scalable generalist robot policies.

Problem

Research questions and friction points this paper is trying to address.

co-training

data modality

robot manipulation

generalization

large behavior models

Innovation

Methods, ideas, or system contributions that make the work stand out.

co-training

large behavior models

robot manipulation