🤖 AI Summary
This work addresses the challenge of generalizing diffusion models to animate characters with arbitrary skeletal topologies (e.g., non-humanoid). We propose the first diffusion-based framework for universal rigging and animation across diverse skeletal structures: (1) a topology-agnostic skeleton encoding representation; (2) an online procedural synthetic data pipeline enabling few-shot, character-specific rig inference from only 3–5 skeleton-annotated images; and (3) a high-fidelity, 2D keypoint-driven rendering method. Our contributions include: (1) the first 2D keypoint animation benchmark covering both humanoid and non-humanoid characters; (2) state-of-the-art performance on both realistic and cartoon-style characters, significantly outperforming existing methods; and (3) empirical validation of strong cross-topology generalization and robustness. The framework bridges a critical gap in diffusion-based character animation by decoupling motion generation from rigid skeletal assumptions, enabling flexible, data-efficient adaptation to novel anatomies.
📝 Abstract
Recent diffusion-based methods have achieved impressive results on animating images of human subjects. However, most of that success has built on human-specific body pose representations and extensive training with labeled real videos. In this work, we extend the ability of such models to animate images of characters with more diverse skeletal topologies. Given a small number (3-5) of example frames showing the character in different poses with corresponding skeletal information, our model quickly infers a rig for that character that can generate images corresponding to new skeleton poses. We propose a procedural data generation pipeline that efficiently samples training data with diverse topologies on the fly. We use it, along with a novel skeleton representation, to train our model on articulated shapes spanning a large space of textures and topologies. Then during fine-tuning, our model rapidly adapts to unseen target characters and generalizes well to rendering new poses, both for realistic and more stylized cartoon appearances. To better evaluate performance on this novel and challenging task, we create the first 2D video dataset that contains both humanoid and non-humanoid subjects with per-frame keypoint annotations. With extensive experiments, we demonstrate the superior quality of our results. Project page: https://traindragondiffusion.github.io/