UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in applying large-scale video diffusion Transformers (DiTs) to human image animation—namely, temporal inconsistency, high training overhead, and poor cross-resolution generalization. To this end, we propose an efficient, high-fidelity animation generation framework built upon the Wan2.1 architecture. We introduce a lightweight 3D-convolutional pose encoder for precise driver-pose modeling; adopt LoRA for parameter-efficient fine-tuning, substantially reducing GPU memory consumption while preserving the original DiT’s strong generative capability; and design an appearance-pose joint embedding mechanism enabling lossless inference scaling from 480p training to 720p output. Experiments demonstrate temporally coherent motion, natural visual quality, and consistent high fidelity across resolutions. The code is publicly released, confirming strong generalizability and practical applicability.

Technology Category

Application Category

📝 Abstract
This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.
Problem

Research questions and friction points this paper is trying to address.

Human image animation using video diffusion
Preserve generative capabilities with LoRA tuning
Enhance pose alignment with lightweight encoder
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LoRA for efficient parameter fine-tuning
Employs 3D convolutional layers for motion encoding
Concatenates reference appearance and pose information
🔎 Similar Papers
No similar papers found.
X
Xiang Wang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
S
Shiwei Zhang
Alibaba Group
Longxiang Tang
Longxiang Tang
Tsinghua University
Computer Vision
Y
Yingya Zhang
Alibaba Group
C
Changxin Gao
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Y
Yuehuan Wang
Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology
Nong Sang
Nong Sang
Huazhong University of Science and Technology
Computer Vision and Pattern Recognition