X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual-language-action models and world models in embodied AI are constrained by the scarcity of high-quality humanoid robot training data. Method: This paper introduces X-Humanoid, a generative video editing framework enabling large-scale, high-fidelity cross-modal translation from human videos to humanoid robot motion videos. We fine-tune a video-to-video model based on Wan 2.2 and integrate it with Unreal Engine to build a scalable synthetic pipeline, generating over 17 hours of paired human–robot video. Additionally, we synthesize over 3.6 million robot-motion frames using Ego-Exo4D and 60 hours of real human videos. Results: User studies show that 69% of participants rated X-Humanoid highest for motion consistency, and 62.1% affirmed its embodied action correctness—both significantly outperforming existing methods. This work establishes the first large-scale, third-person-view-adapted paradigm for humanoid robot motion synthesis, advancing embodied intelligence research.

Technology Category

Application Category

📝 Abstract
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.
Problem

Research questions and friction points this paper is trying to address.

Generates humanoid videos from human videos at scale
Addresses scarcity of diverse training data for humanoid robots
Handles complex full-body motions and scene occlusions in videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts Wan 2.2 model for video-to-video human-to-humanoid translation
Creates paired synthetic training data using Unreal Engine pipeline
Generates large-scale robotized humanoid video dataset from human videos
🔎 Similar Papers