Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models exhibit poor generalization in complex dexterous manipulation tasks, primarily due to reliance on simulation data or small-scale, low-diversity teleoperation demonstrations—failing to bridge the sim-to-real gap. To address this, we propose a VLA pretraining framework grounded in large-scale human manipulation videos. Our approach introduces two key innovations: (1) a novel physics-informed instruction tuning paradigm, and (2) a part-level action tokenization method enabling millimeter-accurate hand trajectory reconstruction. We integrate heterogeneous multimodal data—including motion capture, VR recordings, and RGB videos—within a unified 3D physical-space-aligned modeling architecture. A post-pretraining adaptation stage further enhances downstream task transferability. Experiments demonstrate significant improvements over baselines in instruction-following and hand motion generation. Crucially, the model achieves strong generalization and scalability, validated on real robotic platforms.

Technology Category

Application Category

📝 Abstract
We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources -- including motion capture, VR, and RGB-only videos -- into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.
Problem

Research questions and friction points this paper is trying to address.

Addresses poor generalization in Vision-Language-Action models for novel tasks
Overcomes data bottleneck using human videos for dexterous manipulation
Improves hand motion generation and instruction following accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging human hands for dexterous manipulation
Physical instruction tuning for VLA pretraining
Part-level motion tokenization for precise trajectories
🔎 Similar Papers
No similar papers found.