MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing text-to-motion generation methods exhibit coarse-grained semantic alignment and insufficient exploitation of critical temporal cues—especially under rare-language prompts—leading to motion redundancy and semantic inconsistency. Method: We propose a fine-grained, temporal clip-level text–motion alignment framework. Its core innovation is the Temporal Clip Banzhaf Interaction mechanism, the first to enable interpretable, granular semantic matching between natural language descriptions and motion segments. We integrate this with a motion diffusion model augmented by a motion prompting module and a retrieval-enhanced fine-grained alignment strategy. Contribution/Results: Our method achieves state-of-the-art performance on both text-to-motion retrieval and generation benchmarks, with particularly significant gains under rare-text prompts. Ablations confirm that clip-level cross-modal interaction is essential for enforcing semantic consistency across modalities.

Technology Category

Application Category

📝 Abstract

We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.

Problem

Research questions and friction points this paper is trying to address.

Generating human motion from rare text prompts

Addressing coarse-grained matching and semantic cue issues

Improving text-to-motion retrieval and generation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal clip Banzhaf interaction for fine-grained matching

Motion prompt module utilizing retrieved clips

State-of-the-art text-to-motion retrieval and generation

🔎 Similar Papers

No similar papers found.