LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing text-to-motion generation methods rely on CLIP’s image-text pretrained embeddings, which poorly capture dynamic action semantics, resulting in weak cross-modal alignment and limited generation quality. To address this, we propose LaMP, a novel language-motion joint pretraining framework featuring a dual-stream Transformer architecture. LaMP integrates motion-aware text embeddings, autoregressive masked prediction (to prevent rank collapse), cross-modal attention mechanisms, and motion-feature-driven semantic alignment learning. Furthermore, we introduce LaMP-BERTScore—a dedicated evaluation metric for text-motion generation. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks: 23.6% reduction in text-to-motion FID, 18.4% improvement in motion↔text retrieval R@1, and 9.2% gain in motion description BLEU-4. Our code is publicly available.

Technology Category

Application Category

📝 Abstract

Language plays a vital role in the realm of human motion. Existing methods have largely depended on CLIP text embeddings for motion generation, yet they fall short in effectively aligning language and motion due to CLIP's pretraining on static image-text pairs. This work introduces LaMP, a novel Language-Motion Pretraining model, which transitions from a language-vision to a more suitable language-motion latent space. It addresses key limitations by generating motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences. With LaMP, we advance three key tasks: text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning. For generation, we utilize LaMP to provide the text condition instead of CLIP, and an autoregressive masked prediction is designed to achieve mask modeling without rank collapse in transformers. For retrieval, motion features from LaMP's motion transformer interact with query tokens to retrieve text features from the text transformer, and vice versa. For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model. In addition, we introduce the LaMP-BertScore metric to assess the alignment of generated motions with textual descriptions. Extensive experimental results on multiple datasets demonstrate substantial improvements over previous methods across all three tasks. The code of our method will be made public.

Problem

Research questions and friction points this paper is trying to address.

Enhances alignment between language and motion for generation, retrieval, and captioning.

Introduces LaMP to replace CLIP for better motion-informative text embeddings.

Advances text-to-motion generation, motion-text retrieval, and motion captioning tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LaMP model aligns language-motion latent space

Autoregressive masked prediction prevents rank collapse

LaMP-BertScore evaluates motion-text alignment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs