A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

šŸ“… 2026-02-01
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
This work addresses the limited generalization of large behavioral models in robotic manipulation due to the scarcity of robot data. By leveraging 4,000 hours of human and robot manipulation data alongside 50 million vision–language samples, the study systematically evaluates five heterogeneous co-training modalities—including cross-embodiment robot data, human videos, densely annotated language, and discrete action tokens—under single- and multi-stage training strategies to construct a unified vision–language–action policy. It presents the first large-scale empirical analysis of how different modalities influence generalization, task transfer, and language following, demonstrating that effective modality combinations yield cumulative gains and recover semantic understanding inherent in vision–language models. Extensive validation across 58,000 simulated trials and 2,835 real-world experiments shows that integrating vision–language pretraining with cross-embodiment data significantly enhances generalization and enables rapid adaptation to unseen long-horizon dexterous tasks.

Technology Category

Application Category

šŸ“ Abstract
Large behavior models have shown strong dexterous manipulation capabilities by extending imitation learning to large-scale training on multi-task robot data, yet their generalization remains limited by the insufficient robot data coverage. To expand this coverage without costly additional data collection, recent work relies on co-training: jointly learning from target robot data and heterogeneous data modalities. However, how different co-training data modalities and strategies affect policy performance remains poorly understood. We present a large-scale empirical study examining five co-training data modalities: standard vision-language data, dense language annotations for robot trajectories, cross-embodiment robot data, human videos, and discrete robot action tokens across single- and multi-phase training strategies. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples to train vision-language-action policies. We evaluate 89 policies over 58,000 simulation rollouts and 2,835 real-world rollouts. Our results show that co-training with forms of vision-language and cross-embodiment robot data substantially improves generalization to distribution shifts, unseen tasks, and language following, while discrete action token variants yield no significant benefits. Combining effective modalities produces cumulative gains and enables rapid adaptation to unseen long-horizon dexterous tasks via fine-tuning. Training exclusively on robot data degrades the visiolinguistic understanding of the vision-language model backbone, while co-training with effective modalities restores these capabilities. Explicitly conditioning action generation on chain-of-thought traces learned from co-training data does not improve performance in our simulation benchmark. Together, these results provide practical guidance for building scalable generalist robot policies.
Problem

Research questions and friction points this paper is trying to address.

co-training
data modality
robot manipulation
generalization
large behavior models
Innovation

Methods, ideas, or system contributions that make the work stand out.

co-training
large behavior models
robot manipulation
vision-language-action policies
cross-embodiment data
šŸ”Ž Similar Papers
No similar papers found.
Fanqi Lin
Fanqi Lin
Tsinghua University
Embodied AIRobotics
K
Kushal Arora
Toyota Research Institute, Cambridge MA and Los Altos CA, USA
Jean Mercat
Jean Mercat
Research scientist at Toyota Research Institute
Neural networks
Haruki Nishimura
Haruki Nishimura
Toyota Research Institute
roboticsmachine learningplanning under uncertaintystatisticsprobabilistic inference
P
Paarth Shah
Toyota Research Institute, Cambridge MA and Los Altos CA, USA
Chen Xu
Chen Xu
Toyota Research Institute (TRI)
Imitation LearningGenerative ModelsUncertainty QuantificationOperations Research
M
Mengchao Zhang
Toyota Research Institute, Cambridge MA and Los Altos CA, USA
Mark Zolotas
Mark Zolotas
Research Scientist, Toyota Research Institute
Shared ControlHuman-Robot InteractionExtended RealityRepresentation Learning
M
Maya Angeles
Toyota Research Institute, Cambridge MA and Los Altos CA, USA
O
Owen Pfannenstiehl
Toyota Research Institute, Cambridge MA and Los Altos CA, USA
A
Andrew Beaulieu
Toyota Research Institute, Cambridge MA and Los Altos CA, USA
Jose Barreiros
Jose Barreiros
Scientist, Toyota Research Institute
Physical AIWhole-body ManipulationHaptic Intelligence.