ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing methods struggle to simultaneously achieve high appearance fidelity, natural motion dynamics, and controllable camera viewpoints in human video synthesis when multi-view data are limited. This work proposes an “image-first” generation paradigm: it first leverages a pre-trained image generation model to learn a high-quality human appearance prior, then integrates SMPL-X pose conditioning with a pre-trained video diffusion model. Through a training-free temporal optimization strategy, the approach enables pose- and viewpoint-controllable, high-fidelity video synthesis. By effectively decoupling appearance modeling from temporal consistency, the method significantly enhances both visual quality and controllability. The authors also release a standardized human dataset and the corresponding synthesis model to support future research.

Technology Category

Application Category

📝 Abstract

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

Problem

Research questions and friction points this paper is trying to address.

human video generation

appearance modeling

motion modeling

camera viewpoint

temporal consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

image-first synthesis

controllable human video generation

SMPL-X motion guidance