🤖 AI Summary
Visual-language-action models and world models in embodied AI are constrained by the scarcity of high-quality humanoid robot training data. Method: This paper introduces X-Humanoid, a generative video editing framework enabling large-scale, high-fidelity cross-modal translation from human videos to humanoid robot motion videos. We fine-tune a video-to-video model based on Wan 2.2 and integrate it with Unreal Engine to build a scalable synthetic pipeline, generating over 17 hours of paired human–robot video. Additionally, we synthesize over 3.6 million robot-motion frames using Ego-Exo4D and 60 hours of real human videos. Results: User studies show that 69% of participants rated X-Humanoid highest for motion consistency, and 62.1% affirmed its embodied action correctness—both significantly outperforming existing methods. This work establishes the first large-scale, third-person-view-adapted paradigm for humanoid robot motion synthesis, advancing embodied intelligence research.
📝 Abstract
The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.