RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

📅 2025-04-21

📈 Citations: 1

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Controllable character animation faces generalization bottlenecks in open-world scenarios—particularly under rare poses, stylized characters, human-object interactions, complex lighting, and dynamic backgrounds. To address this, we propose a lightweight controllable animation framework built upon video diffusion Transformers (DiTs). Our approach eliminates redundant reference networks and introduces minimal architectural modifications. We pioneer a low-noise warm-up mechanism and a “large-batch, few-iteration” fine-tuning strategy to balance convergence efficiency with preservation of the base model’s priors. Additionally, we integrate progressive noise scheduling with large-scale video fine-tuning. Evaluated on our newly curated real-world test set and established benchmarks—including TikTok and UBC Fashion video datasets—our method significantly outperforms prior art, establishing a new strong baseline for controllable character animation.

Technology Category

Application Category

📝 Abstract

Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and"large batches and small iterations"strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.

Problem

Research questions and friction points this paper is trying to address.

Addressing controllable character animation challenges in diverse real-world scenarios

Improving generalization for rare poses and complex scenes with minimal model changes

Enhancing fine-tuning strategies to preserve foundation model priors effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimal modifications to foundation model architecture

Low-noise warmup and large batches strategy

Built upon Wan-2.1 video foundation model

🔎 Similar Papers

No similar papers found.