SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

📅 2024-12-21

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of jointly modeling rhythmic motion and semantically salient gestures in speech-driven full-body gesture generation. We propose a hierarchical dual-path architecture: one path enforces temporal rhythm alignment via a contrastive rhythm consistency loss; the other employs a semantic-aware sparse motion generator, incorporating a learnable semantic scoring mechanism to explicitly model and weight semantic drivers at the frame level. We further introduce a novel adaptive fusion module that dynamically coordinates rhythm and semantics. Evaluated on BEAT and TWH benchmarks, our method achieves significant improvements over state-of-the-art approaches—reducing FID by 18.3% and MM-Dist by 12.7%. Qualitative analysis confirms enhanced accuracy and naturalness in semantically critical frames, such as deictic and negation gestures.

Technology Category

Application Category

📝 Abstract

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn general motions and sparse motions, and then adaptively fuse them. In particular, rhythmic consistency learning is explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, textit{semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

Problem

Research questions and friction points this paper is trying to address.

Gesture Generation

Speech Synchronization

Semantic Emphasis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Gestures

Rhythm-Semantic Fusion

Semantic Score Synthesis

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning