How to Move Your Dragon: Text-to-Motion Synthesis for Large-Vocabulary Objects

πŸ“… 2025-03-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Text-to-3D motion generation suffers from the scarcity of large-scale, fine-grained motion datasets and poor generalization across species or heterogeneous skeletal topologies. To address these challenges, this paper introduces the first universal text-driven motion generation framework for large-vocabulary objects. Our approach comprises three key contributions: (1) Truebones Zooβ€”the first large-scale animal motion dataset with fine-grained semantic annotations; (2) a rig augmentation strategy coupled with a dynamic skeletal-aware diffusion model, enabling adaptive modeling of arbitrary skeletal topologies; and (3) substantial improvements in high-fidelity motion synthesis for both seen and unseen objects across multi-category, multi-skeleton scenarios. Extensive experiments demonstrate state-of-the-art performance on multiple benchmarks, validating robust cross-species generalization and topology-agnostic motion generation.

Technology Category

Application Category

πŸ“ Abstract
Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of high-quality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects. To address these challenges, we contribute the following: First, we augment the Truebones Zoo dataset, a high-quality animal motion dataset covering over 70 species, by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis. Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations. Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures. Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on this link: t2m4lvo.github.io
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive motion datasets with diverse annotations
Absence of methods for handling heterogeneous skeletal templates
Challenges in text-to-motion synthesis for diverse object categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmented Truebones Zoo with text annotations
Introduced rig augmentation for diverse motions
Redesigned motion diffusion for skeletal adaptability