🤖 AI Summary
This work addresses the challenge of jointly modeling rhythmic motion and semantically salient gestures in speech-driven full-body gesture generation. We propose a hierarchical dual-path architecture: one path enforces temporal rhythm alignment via a contrastive rhythm consistency loss; the other employs a semantic-aware sparse motion generator, incorporating a learnable semantic scoring mechanism to explicitly model and weight semantic drivers at the frame level. We further introduce a novel adaptive fusion module that dynamically coordinates rhythm and semantics. Evaluated on BEAT and TWH benchmarks, our method achieves significant improvements over state-of-the-art approaches—reducing FID by 18.3% and MM-Dist by 12.7%. Qualitative analysis confirms enhanced accuracy and naturalness in semantically critical frames, such as deictic and negation gestures.
📝 Abstract
A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn general motions and sparse motions, and then adaptively fuse them. In particular, rhythmic consistency learning is explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, textit{semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.