StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address identity (ID) consistency degradation in human image animation caused by pose discrepancies, this paper proposes the first learnable pose-alignment ID-preserving video diffusion framework. Methodologically, it introduces a novel SVD-guided learnable similarity transformation alignment layer and a distribution-aware ID adapter, integrated with a global content-aware facial encoder and a Hamilton–Jacobi–Bellman equation-based facial optimization module—all trained end-to-end without post-processing. The framework effectively mitigates pose misalignment and facial distortion. Quantitative and qualitative evaluations across multiple benchmarks demonstrate state-of-the-art performance in ID fidelity and facial quality. This work establishes a new paradigm for diffusion-based human animation that jointly achieves robustness and high fidelity.

Technology Category

Application Category

📝 Abstract
Current diffusion models for human image animation often struggle to maintain identity (ID) consistency, especially when the reference image and driving video differ significantly in body size or position. We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing. Building upon a video diffusion model, StableAnimator++ contains carefully designed modules for both training and inference, striving for identity consistency. In particular, StableAnimator++ first uses learnable layers to predict the similarity transformation matrices between the reference image and the driven poses via injecting guidance from Singular Value Decomposition (SVD). These matrices align the driven poses with the reference image, mitigating misalignment to a great extent. StableAnimator++ then computes image and face embeddings using off-the-shelf encoders, refining the face embeddings via a global content-aware Face Encoder. To further maintain ID, we introduce a distribution-aware ID Adapter that counteracts interference caused by temporal layers while preserving ID via distribution alignment. During the inference stage, we propose a novel Hamilton-Jacobi-Bellman (HJB) based face optimization integrated into the denoising process, guiding the diffusion trajectory for enhanced facial fidelity. Experiments on benchmarks show the effectiveness of StableAnimator++ both qualitatively and quantitatively.
Problem

Research questions and friction points this paper is trying to address.

Overcoming pose misalignment in human image animation
Reducing face distortion for identity consistency
Enhancing facial fidelity without post-processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learnable pose alignment via SVD guidance
Global content-aware Face Encoder for refinement
HJB-based face optimization during denoising
🔎 Similar Papers
No similar papers found.