CASIM: Composite Aware Semantic Injection for Text to Motion Generation

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing text-to-motion generation methods rely on fixed-length text embeddings (e.g., CLIP), limiting their ability to model the compositional structure of human motion—resulting in coarse-grained semantic alignment and poor controllability. To address this, we propose a compositional-aware semantic injection framework: (1) a compositional-aware semantic encoder that explicitly models hierarchical action semantics; (2) a dynamic token-level text-motion aligner enabling fine-grained cross-modal matching; and (3) a unified injection architecture compatible with both autoregressive and diffusion-based motion generators. Our method is plug-and-play and model-agnostic. Experiments on HumanML3D and KIT-Motion demonstrate significant improvements in motion quality (FID ↓), text-motion alignment accuracy (R-Precision ↑), and cross-modal retrieval performance. Moreover, it enhances generation controllability and generalization across unseen actions and linguistic variations.

Technology Category

Application Category

📝 Abstract

Recent advances in generative modeling and tokenization have driven significant progress in text-to-motion generation, leading to enhanced quality and realism in generated motions. However, effectively leveraging textual information for conditional motion generation remains an open challenge. We observe that current approaches, primarily relying on fixed-length text embeddings (e.g., CLIP) for global semantic injection, struggle to capture the composite nature of human motion, resulting in suboptimal motion quality and controllability. To address this limitation, we propose the Composite Aware Semantic Injection Mechanism (CASIM), comprising a composite-aware semantic encoder and a text-motion aligner that learns the dynamic correspondence between text and motion tokens. Notably, CASIM is model and representation-agnostic, readily integrating with both autoregressive and diffusion-based methods. Experiments on HumanML3D and KIT benchmarks demonstrate that CASIM consistently improves motion quality, text-motion alignment, and retrieval scores across state-of-the-art methods. Qualitative analyses further highlight the superiority of our composite-aware approach over fixed-length semantic injection, enabling precise motion control from text prompts and stronger generalization to unseen text inputs.

Problem

Research questions and friction points this paper is trying to address.

Improve text-to-motion generation quality

Enhance text-motion alignment control

Generalize to unseen text inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composite-aware semantic encoder

Dynamic text-motion aligner

Model-agnostic integration

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning