Text-driven Motion Generation: Overview, Challenges and Directions

📅 2025-05-14

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This paper addresses the cross-modal generation problem of text-to-human-motion synthesis, aiming to enable flexible, fine-grained linguistic control over animated avatars. Methodologically, it presents a systematic survey tracing the field’s evolution—from action-label-conditioned prediction to end-to-end language-driven paradigms—and introduces, for the first time, a two-dimensional taxonomy: “model architecture (VAE/diffusion/hybrid) × motion representation (discrete/continuous).” Leveraging benchmarks including KIT-ML and HumanML3D, the work unifies variational autoencoders, diffusion modeling, motion tokenization, and multimodal alignment techniques into a coherent evaluation framework. Key contributions include: (i) clarifying core challenges—semantic alignment, temporal coherence, and motion realism; (ii) establishing empirical performance boundaries; and (iii) advancing methodological standardization and reproducibility. The study provides foundational technical support for applications in VR, gaming, human–computer interaction, and embodied AI.

Technology Category

Application Category

📝 Abstract

Text-driven motion generation offers a powerful and intuitive way to create human movements directly from natural language. By removing the need for predefined motion inputs, it provides a flexible and accessible approach to controlling animated characters. This makes it especially useful in areas like virtual reality, gaming, human-computer interaction, and robotics. In this review, we first revisit the traditional perspective on motion synthesis, where models focused on predicting future poses from observed initial sequences, often conditioned on action labels. We then provide a comprehensive and structured survey of modern text-to-motion generation approaches, categorizing them from two complementary perspectives: (i) architectural, dividing methods into VAE-based, diffusion-based, and hybrid models; and (ii) motion representation, distinguishing between discrete and continuous motion generation strategies. In addition, we explore the most widely used datasets, evaluation methods, and recent benchmarks that have shaped progress in this area. With this survey, we aim to capture where the field currently stands, bring attention to its key challenges and limitations, and highlight promising directions for future exploration. We hope this work offers a valuable starting point for researchers and practitioners working to push the boundaries of language-driven human motion synthesis.

Problem

Research questions and friction points this paper is trying to address.

Generating human motion from natural language descriptions

Surveying modern text-to-motion approaches and architectures

Addressing challenges in language-driven motion synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-driven motion generation from natural language

VAE-based, diffusion-based, and hybrid models

Discrete and continuous motion representation strategies

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning