🤖 AI Summary
Existing text-to-motion generation methods exhibit coarse-grained semantic alignment and insufficient exploitation of critical temporal cues—especially under rare-language prompts—leading to motion redundancy and semantic inconsistency. Method: We propose a fine-grained, temporal clip-level text–motion alignment framework. Its core innovation is the Temporal Clip Banzhaf Interaction mechanism, the first to enable interpretable, granular semantic matching between natural language descriptions and motion segments. We integrate this with a motion diffusion model augmented by a motion prompting module and a retrieval-enhanced fine-grained alignment strategy. Contribution/Results: Our method achieves state-of-the-art performance on both text-to-motion retrieval and generation benchmarks, with particularly significant gains under rare-text prompts. Ablations confirm that clip-level cross-modal interaction is essential for enforcing semantic consistency across modalities.
📝 Abstract
We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.