EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models struggle to model complex human motions due to high joint degrees of freedom and pixel-level training objectives that ignore kinematic constraints, leading to motion distortions and temporal incoherence. To address this, we propose HuMoGen—a dual-modality diffusion Transformer framework featuring a novel Motion-Video Synchronized 3D positional encoding (MVS-RoPE) and a two-stage cross-modal conditional diffusion training strategy. We also introduce HuMoVe, the first large-scale human-centric video–motion paired dataset (80K pairs). HuMoGen explicitly models the joint distribution of appearance and skeletal motion while incorporating human kinematic priors. Experiments demonstrate significant improvements over state-of-the-art methods in temporal coherence, motion plausibility, and bidirectional cross-modal generation (video ↔ motion). Critically, the generated motion sequences exhibit strong physical drivability—enabling robust downstream control in physics-based simulation.

Technology Category

Application Category

📝 Abstract
Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
Problem

Research questions and friction points this paper is trying to address.

Improves complex human action video generation quality
Models joint distribution of appearance and human motion
Enables cross-modal conditional generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch Diffusion Transformer processes multimodal tokens
MVS-RoPE provides unified 3D positional encoding for alignment
Two-stage training enables joint and cross-modal generation
🔎 Similar Papers
No similar papers found.