UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses key challenges in applying large-scale video diffusion Transformers (DiTs) to human image animation—namely, temporal inconsistency, high training overhead, and poor cross-resolution generalization. To this end, we propose an efficient, high-fidelity animation generation framework built upon the Wan2.1 architecture. We introduce a lightweight 3D-convolutional pose encoder for precise driver-pose modeling; adopt LoRA for parameter-efficient fine-tuning, substantially reducing GPU memory consumption while preserving the original DiT’s strong generative capability; and design an appearance-pose joint embedding mechanism enabling lossless inference scaling from 480p training to 720p output. Experiments demonstrate temporally coherent motion, natural visual quality, and consistent high fidelity across resolutions. The code is publicly released, confirming strong generalizability and practical applicability.

Technology Category

Application Category

📝 Abstract

This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.

Problem

Research questions and friction points this paper is trying to address.

Human image animation using video diffusion

Preserve generative capabilities with LoRA tuning

Enhance pose alignment with lightweight encoder

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LoRA for efficient parameter fine-tuning

Employs 3D convolutional layers for motion encoding

Concatenates reference appearance and pose information

🔎 Similar Papers

No similar papers found.

Authors to Follow