CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

šŸ“… 2026-01-15
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing approaches typically treat 3D human motion generation and 2D video synthesis as separate tasks, often failing to ensure structural consistency and semantic coherence between them. This work proposes CoMoVi, a unified framework that jointly models both tasks within a single diffusion denoising process. CoMoVi employs a dual-branch architecture coupled with a 3D–2D cross-attention mechanism to enable effective feature interaction. We introduce a 2D human motion representation that inherits priors from pretrained video models and construct a large-scale multimodal CoMoVi dataset annotated with both text and motion labels. Experimental results demonstrate significant improvements in both 3D motion generation and photorealistic video synthesis, validating the effectiveness and generalization capability of the proposed joint generation paradigm.

Technology Category

Application Category

šŸ“ Abstract
In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
Problem

Research questions and friction points this paper is trying to address.

3D human motion generation
video generation
motion-video coupling
human motion representation
diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

co-generative framework
3D human motion
video diffusion models
cross-attention
motion-video coupling