Emergence of Human to Robot Transfer in Vision-Language-Action Models

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Human-to-robot action transfer typically requires manually engineered cross-modal mappings, limiting scalability and generalizability. Method: We propose a vision-language-action (VLA) pretraining paradigm that eliminates the need for ground-truth robot action annotations. Our approach leverages large-scale, diverse human videos paired with robot trajectories across multiple scenes, tasks, and morphologies, trained via joint distillation. It integrates cross-modal alignment representation learning with scale-driven emergence analysis, and introduces an embodiment-agnostic representation mechanism to decouple learning from robot-specific kinematics. Contribution/Results: We empirically discover and validate that “human-to-robot transfer capability” spontaneously emerges with increasing pretraining diversity—a previously unobserved phenomenon. On zero-shot generalization to unseen human videos, our method achieves nearly double the performance of prior approaches, demonstrating that the model implicitly learns a robust mapping from human actions to executable robot policies without explicit supervision.

Technology Category

Application Category

📝 Abstract

Vision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises because diverse pretraining produces embodiment-agnostic representations for human and robot data. We validate these findings through a series of experiments probing human to robot skill transfer and find that with sufficiently diverse robot pre-training our method can nearly double the performance on generalization settings seen only in human data.

Problem

Research questions and friction points this paper is trying to address.

Enables robot learning from human videos

Maps human actions to robot embodiments

Improves generalization via diverse pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-training recipe for human-robot transfer

Diverse pre-training produces embodiment-agnostic representations

Human video data enhances robot generalization performance

🔎 Similar Papers

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics