🤖 AI Summary
This work addresses the bottleneck in text-driven human motion generation—its reliance on predefined action labels and inability to handle open-domain natural language descriptions—by formally introducing and systematically studying the novel task of *arbitrary text-to-motion generation*. To support this task, we construct HUMANML3D++, the first large-scale extended dataset. We propose an *action instruction extraction framework* that explicitly decouples semantic parsing from motion synthesis, and design a *diffusion-based controllable generation architecture*. Furthermore, we establish a multi-solution consistency evaluation benchmark, incorporating new metrics including Multi-Motion FID and Diversity. Experiments demonstrate that our approach significantly outperforms baselines in semantic alignment, motion diversity, and controllability. It effectively enables real-world applications such as virtual human interaction, robotic behavior planning, and cinematic motion synthesis, advancing text-to-motion generation toward practical deployment.
📝 Abstract
Text to Motion aims to generate human motions from texts. Existing settings rely on limited Action Texts that include action labels, which limits flexibility and practicability in scenarios difficult to describe directly. This paper extends limited Action Texts to arbitrary ones. Scene texts without explicit action labels can enhance the practicality of models in complex and diverse industries such as virtual human interaction, robot behavior generation, and film production, while also supporting the exploration of potential implicit behavior patterns. However, newly introduced Scene Texts may yield multiple reasonable output results, causing significant challenges in existing data, framework, and evaluation. To address this practical issue, we first create a new dataset HUMANML3D++ by extending texts of the largest existing dataset HUMANML3D. Secondly, we propose a simple yet effective framework that extracts action instructions from arbitrary texts and subsequently generates motions. Furthermore, we also benchmark this new setting with multi-solution metrics to address the inadequacies of existing single-solution metrics. Extensive experiments indicate that Text to Motion in this realistic setting is challenging, fostering new research in this practical direction.