RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

📅 2025-04-21
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Controllable character animation faces generalization bottlenecks in open-world scenarios—particularly under rare poses, stylized characters, human-object interactions, complex lighting, and dynamic backgrounds. To address this, we propose a lightweight controllable animation framework built upon video diffusion Transformers (DiTs). Our approach eliminates redundant reference networks and introduces minimal architectural modifications. We pioneer a low-noise warm-up mechanism and a “large-batch, few-iteration” fine-tuning strategy to balance convergence efficiency with preservation of the base model’s priors. Additionally, we integrate progressive noise scheduling with large-scale video fine-tuning. Evaluated on our newly curated real-world test set and established benchmarks—including TikTok and UBC Fashion video datasets—our method significantly outperforms prior art, establishing a new strong baseline for controllable character animation.

Technology Category

Application Category

📝 Abstract
Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and"large batches and small iterations"strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.
Problem

Research questions and friction points this paper is trying to address.

Addressing controllable character animation challenges in diverse real-world scenarios
Improving generalization for rare poses and complex scenes with minimal model changes
Enhancing fine-tuning strategies to preserve foundation model priors effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimal modifications to foundation model architecture
Low-noise warmup and large batches strategy
Built upon Wan-2.1 video foundation model
🔎 Similar Papers
No similar papers found.
Jingkai Zhou
Jingkai Zhou
Independent Researcher
Computer vision
Y
Yifan Wu
Southern University of Science and Technology
S
Shikai Li
DAMO Academy, Alibaba Group,Hupan Lab
Min Wei
Min Wei
Alibaba DAMO Academy
C
Chao Fan
Shenzhen University
Weihua Chen
Weihua Chen
Alibaba DAMO Academy, previously NLPR, CASIA
Computer Vision
W
Wei Jiang
Zhejiang University
F
Fan Wang
DAMO Academy, Alibaba Group,Hupan Lab