OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

236K/year
🤖 AI Summary
This work addresses the limited generalizability of existing cross-embodiment video generation methods, which suffer from entangled motion and morphology representations and rely on paired data for target embodiments. To overcome these limitations, the authors propose a motion-morphology disentangled modeling framework that enables rapid adaptation to new robots without requiring paired data, leveraging a shared motion model and lightweight embodiment adapters. A novel branch-isolated attention mechanism is introduced to effectively separate motion conditioning from embodiment-specific modulation. The study also presents the first large-scale synthetic dataset of cross-embodiment paired videos. Experimental results demonstrate high motion fidelity and embodiment consistency on both synthetic and real-world benchmarks, with successful zero-shot transfer to unseen humanoid embodiments without retraining the shared motion model.
📝 Abstract
Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.
Problem

Research questions and friction points this paper is trying to address.

cross-embodiment
video generation
motion transfer
embodiment adaptation
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-embodiment generation
unpaired adaptation
motion disentanglement
branch-isolated attention
embodied intelligence
🔎 Similar Papers
No similar papers found.