HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing video diffusion models struggle to accurately capture the dynamics and physical properties of human motion, often resulting in generated videos that lack temporal coherence and fine-grained realism. To address this limitation, this work proposes HumANDiff, a framework that enhances motion modeling without altering the underlying diffusion architecture. By incorporating articulated noise sampling based on a 3D human template, a joint learning mechanism for appearance and motion, and a geometric motion consistency loss defined in the noise space, HumANDiff achieves spatiotemporally coherent motion generation with intrinsic motion control. The method supports image-to-video synthesis from a single input frame and generates high-fidelity, naturally moving human videos across diverse clothing styles, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/

Problem

Research questions and friction points this paper is trying to address.

human video generation

motion consistency

diffusion models

articulated motion

physical dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Articulated Noise Diffusion

Motion-Consistent Video Generation

Joint Appearance-Motion Learning