Pushing the Boundaries of Text to Motion with Arbitrary Text: A New Task

📅 2024-04-23

📈 Citations: 1

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the bottleneck in text-driven human motion generation—its reliance on predefined action labels and inability to handle open-domain natural language descriptions—by formally introducing and systematically studying the novel task of *arbitrary text-to-motion generation*. To support this task, we construct HUMANML3D++, the first large-scale extended dataset. We propose an *action instruction extraction framework* that explicitly decouples semantic parsing from motion synthesis, and design a *diffusion-based controllable generation architecture*. Furthermore, we establish a multi-solution consistency evaluation benchmark, incorporating new metrics including Multi-Motion FID and Diversity. Experiments demonstrate that our approach significantly outperforms baselines in semantic alignment, motion diversity, and controllability. It effectively enables real-world applications such as virtual human interaction, robotic behavior planning, and cinematic motion synthesis, advancing text-to-motion generation toward practical deployment.

Technology Category

Application Category

📝 Abstract

Text to Motion aims to generate human motions from texts. Existing settings rely on limited Action Texts that include action labels, which limits flexibility and practicability in scenarios difficult to describe directly. This paper extends limited Action Texts to arbitrary ones. Scene texts without explicit action labels can enhance the practicality of models in complex and diverse industries such as virtual human interaction, robot behavior generation, and film production, while also supporting the exploration of potential implicit behavior patterns. However, newly introduced Scene Texts may yield multiple reasonable output results, causing significant challenges in existing data, framework, and evaluation. To address this practical issue, we first create a new dataset HUMANML3D++ by extending texts of the largest existing dataset HUMANML3D. Secondly, we propose a simple yet effective framework that extracts action instructions from arbitrary texts and subsequently generates motions. Furthermore, we also benchmark this new setting with multi-solution metrics to address the inadequacies of existing single-solution metrics. Extensive experiments indicate that Text to Motion in this realistic setting is challenging, fostering new research in this practical direction.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Motion Generation

Unsupervised Action Synthesis

Natural Language Interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-Motion Generation

HUMANML3D++ Dataset

Multi-Dimensional Evaluation

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning