🤖 AI Summary
This work addresses the challenge of achieving precise alignment between motion dynamics and semantic content in text-driven human motion generation. To this end, the authors propose the MLA-Gen framework, which integrates global motion priors with fine-grained local textual conditions to jointly model general motion patterns and detailed semantic correspondence. The study is the first to identify an “attention collapse” phenomenon during generation and introduces SinkRatio—a novel metric to quantitatively assess alignment quality. Building upon this insight, the authors design alignment-aware masking and attention modulation strategies to refine the motion distribution. Extensive experiments demonstrate that MLA-Gen significantly outperforms strong existing baselines across multiple benchmarks, achieving state-of-the-art performance in both motion quality and text-motion semantic alignment.
📝 Abstract
Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.