π€ AI Summary
Existing methods struggle to simultaneously achieve high appearance fidelity, natural motion dynamics, and controllable camera viewpoints in human video synthesis when multi-view data are limited. This work proposes an βimage-firstβ generation paradigm: it first leverages a pre-trained image generation model to learn a high-quality human appearance prior, then integrates SMPL-X pose conditioning with a pre-trained video diffusion model. Through a training-free temporal optimization strategy, the approach enables pose- and viewpoint-controllable, high-fidelity video synthesis. By effectively decoupling appearance modeling from temporal consistency, the method significantly enhances both visual quality and controllability. The authors also release a standardized human dataset and the corresponding synthesis model to support future research.
π Abstract
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.