🤖 AI Summary
Current video diffusion models struggle to generate human animations in open-domain dynamic backgrounds due to coarse-grained pose conditioning (e.g., DWPose/SMPL-X), resulting in poor character-background alignment and inaccurate motion following. To address this, we propose AvatarDiff, a human-centric animation model that introduces a novel avatar-background joint conditioning mechanism, reframing animation generation as a conditional inpainting task. Built upon an image-to-video diffusion architecture, AvatarDiff integrates multimodal motion representations (DWPose and SMPL-X) with a dual-path conditional encoder, significantly improving temporal coherence and cross-scene generalization under dynamic backgrounds. Extensive evaluation across diverse scenarios demonstrates that AvatarDiff produces high-fidelity, motion-accurate human animations, outperforming existing state-of-the-art methods. The code is publicly available.
📝 Abstract
Recent advances in video diffusion models have significantly improved character animation techniques. However, current approaches rely on basic structural conditions such as DWPose or SMPL-X to animate character images, limiting their effectiveness in open-domain scenarios with dynamic backgrounds or challenging human poses. In this paper, we introduce $ extbf{AniCrafter}$, a diffusion-based human-centric animation model that can seamlessly integrate and animate a given character into open-domain dynamic backgrounds while following given human motion sequences. Built on cutting-edge Image-to-Video (I2V) diffusion architectures, our model incorporates an innovative"avatar-background"conditioning mechanism that reframes open-domain human-centric animation as a restoration task, enabling more stable and versatile animation outputs. Experimental results demonstrate the superior performance of our method. Codes will be available at https://github.com/MyNiuuu/AniCrafter.