π€ AI Summary
Text-to-3D motion generation suffers from the scarcity of large-scale, fine-grained motion datasets and poor generalization across species or heterogeneous skeletal topologies. To address these challenges, this paper introduces the first universal text-driven motion generation framework for large-vocabulary objects. Our approach comprises three key contributions: (1) Truebones Zooβthe first large-scale animal motion dataset with fine-grained semantic annotations; (2) a rig augmentation strategy coupled with a dynamic skeletal-aware diffusion model, enabling adaptive modeling of arbitrary skeletal topologies; and (3) substantial improvements in high-fidelity motion synthesis for both seen and unseen objects across multi-category, multi-skeleton scenarios. Extensive experiments demonstrate state-of-the-art performance on multiple benchmarks, validating robust cross-species generalization and topology-agnostic motion generation.
π Abstract
Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of high-quality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects. To address these challenges, we contribute the following: First, we augment the Truebones Zoo dataset, a high-quality animal motion dataset covering over 70 species, by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis. Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations. Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures. Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on this link: t2m4lvo.github.io