MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-motion generation methods exhibit coarse-grained semantic alignment and insufficient exploitation of critical temporal cues—especially under rare-language prompts—leading to motion redundancy and semantic inconsistency. Method: We propose a fine-grained, temporal clip-level text–motion alignment framework. Its core innovation is the Temporal Clip Banzhaf Interaction mechanism, the first to enable interpretable, granular semantic matching between natural language descriptions and motion segments. We integrate this with a motion diffusion model augmented by a motion prompting module and a retrieval-enhanced fine-grained alignment strategy. Contribution/Results: Our method achieves state-of-the-art performance on both text-to-motion retrieval and generation benchmarks, with particularly significant gains under rare-text prompts. Ablations confirm that clip-level cross-modal interaction is essential for enforcing semantic consistency across modalities.

Technology Category

Application Category

📝 Abstract
We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.
Problem

Research questions and friction points this paper is trying to address.

Generating human motion from rare text prompts
Addressing coarse-grained matching and semantic cue issues
Improving text-to-motion retrieval and generation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal clip Banzhaf interaction for fine-grained matching
Motion prompt module utilizing retrieved clips
State-of-the-art text-to-motion retrieval and generation
🔎 Similar Papers
No similar papers found.
Y
Yin Wang
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
M
Mu li
State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
Zhiying Leng
Zhiying Leng
Beihang University | Technische Universität München
Hand Pose EstimationGraph Neural NetworkSemantic Segmentation
F
Frederick W. B. Li
Department of Computer Science, University of Durham, Durham, UK
Xiaohui Liang
Xiaohui Liang
University of Massachusetts Boston
Mobile HealthcareVoice TechnologyInternet of ThingsPrivacy