🤖 AI Summary
Existing video diffusion models struggle to accurately capture the dynamics and physical properties of human motion, often resulting in generated videos that lack temporal coherence and fine-grained realism. To address this limitation, this work proposes HumANDiff, a framework that enhances motion modeling without altering the underlying diffusion architecture. By incorporating articulated noise sampling based on a 3D human template, a joint learning mechanism for appearance and motion, and a geometric motion consistency loss defined in the noise space, HumANDiff achieves spatiotemporally coherent motion generation with intrinsic motion control. The method supports image-to-video synthesis from a single input frame and generates high-fidelity, naturally moving human videos across diverse clothing styles, significantly outperforming current state-of-the-art approaches.
📝 Abstract
Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/