MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-full-body human video generation suffers from motion discontinuity, physically implausible dynamics, and structural distortions—primarily due to existing models’ overemphasis on appearance fidelity at the expense of explicit motion modeling and scarcity of high-quality, full-body motion data. This paper proposes a structure-appearance disentanglement framework: a 3D structure generator first produces physically plausible full-body motion sequences, which are then refined by a diffusion-based appearance synthesizer. To enable precise temporal control from sparse keypoints, we introduce a human-perceptual dynamic control module and impose dense pose tracking constraints. Furthermore, we construct the first large-scale, multi-view, long-duration full-body motion dataset. Experiments demonstrate that our method significantly outperforms prior approaches in motion coherence, physical plausibility, and visual realism, enabling robust generation of complex, diverse, and long-horizon full-body motions.

Technology Category

Application Category

📝 Abstract
Generating human videos with consistent motion from text prompts remains a significant challenge, particularly for whole-body or long-range motion. Existing video generation models prioritize appearance fidelity, resulting in unrealistic or physically implausible human movements with poor structural coherence. Additionally, most existing human video datasets primarily focus on facial or upper-body motions, or consist of vertically oriented dance videos, limiting the scope of corresponding generation methods to simple movements. To overcome these challenges, we propose MoCo, which decouples the process of human video generation into two components: structure generation and appearance generation. Specifically, our method first employs an efficient 3D structure generator to produce a human motion sequence from a text prompt. The remaining video appearance is then synthesized under the guidance of the generated structural sequence. To improve fine-grained control over sparse human structures, we introduce Human-Aware Dynamic Control modules and integrate dense tracking constraints during training. Furthermore, recognizing the limitations of existing datasets, we construct a large-scale whole-body human video dataset featuring complex and diverse motions. Extensive experiments demonstrate that MoCo outperforms existing approaches in generating realistic and structurally coherent human videos.
Problem

Research questions and friction points this paper is trying to address.

Generating consistent human motion from text prompts
Overcoming unrealistic movements in video generation
Addressing limited whole-body motion datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-appearance decoupling for human video generation
3D structure generator for motion sequences
Human-aware dynamic control with tracking constraints
🔎 Similar Papers
No similar papers found.
H
Haoyu Wang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
H
Hao Tang
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Donglin Di
Donglin Di
Li Auto Inc.
Generative ModelsEmbodied AIMedical ImageMultimedia
Zhilu Zhang
Zhilu Zhang
Harbin Institute of Technology
Low-Level VisionComputational Photography3D Reconstruction and Generation
Wangmeng Zuo
Wangmeng Zuo
School of Computer Science and Technology, Harbin Institute of Technology
Computer VisionImage ProcessingGenerative AIDeep LearningBiometrics
F
Feng Gao
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
S
Siwei Ma
State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
Shiliang Zhang
Shiliang Zhang
Department of Computer Science, School of EECS, Peking University
Multimedia Information RetrievalMultimedia SystemsVisual Search