Exploring Motion-Language Alignment for Text-driven Motion Generation

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving precise alignment between motion dynamics and semantic content in text-driven human motion generation. To this end, the authors propose the MLA-Gen framework, which integrates global motion priors with fine-grained local textual conditions to jointly model general motion patterns and detailed semantic correspondence. The study is the first to identify an “attention collapse” phenomenon during generation and introduces SinkRatio—a novel metric to quantitatively assess alignment quality. Building upon this insight, the authors design alignment-aware masking and attention modulation strategies to refine the motion distribution. Extensive experiments demonstrate that MLA-Gen significantly outperforms strong existing baselines across multiple benchmarks, achieving state-of-the-art performance in both motion quality and text-motion semantic alignment.
📝 Abstract
Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

motion-language alignment
text-driven motion generation
semantic grounding
attention sink
human motion generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

motion-language alignment
attention sink
text-driven motion generation
SinkRatio
alignment-aware masking
🔎 Similar Papers
No similar papers found.
R
Ruxi Gu
Department of Automation, University of Science and Technology of China, Hefei, China
Zilei Wang
Zilei Wang
University of Science and Technology of China
Computer VisionDeep LearningPattern Recognition
W
Wei Wang
State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China