Improving Human Image Animation via Semantic Representation Alignment

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenges of body distortion and facial artifacts commonly encountered in portrait animation when generating long videos or depicting vigorous motions. To mitigate these issues, the authors propose SemanticREPA, a method that integrates human structural and identity (ID) semantic representations as supervisory signals—rather than conditional inputs—within a diffusion model framework. By introducing dedicated structure-alignment and ID-alignment modules, and leveraging video depth estimation alongside face recognition features, the approach utilizes structural priors to guide faithful reconstruction of identity-critical regions. This design effectively enhances both structural stability and identity consistency in the generated outputs. Experimental results demonstrate that SemanticREPA significantly outperforms existing methods under complex motion dynamics and extended temporal generation scenarios.

📝 Abstract

The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.

Problem

Research questions and friction points this paper is trying to address.

human image animation

semantic representation

3D geometric relationships

temporal coherence

facial distortion

Innovation

Methods, ideas, or system contributions that make the work stand out.

representation alignment

human image animation

structure consistency