3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing approaches for novel view synthesis often rely on 2D poses or explicit 3D models, which struggle to achieve high-quality results and are susceptible to reconstruction errors. This work proposes an implicit, viewpoint-invariant motion representation by jointly training a motion encoder with a pretrained video generator to distill driving frames into compact motion tokens, which are then injected into the generation process via cross-attention. Multi-view video supervision is leveraged to enhance 3D consistency, while SMPL-assisted initialization combined with annealed geometric supervision enables a smooth transition from geometry-based initialization to data-driven refinement. Notably, the method operates without external 3D models, preserves high-fidelity motion, and supports text-driven camera control, significantly outperforming current state-of-the-art techniques.

Technology Category

Application Category

📝 Abstract

Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

Problem

Research questions and friction points this paper is trying to address.

3D-aware

motion control

view-adaptive

human video generation

implicit representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D-aware motion control

implicit motion representation

view-agnostic video generation