SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses semantic inconsistency in speech-driven gesture generation—specifically, the tendency of existing methods to produce only rhythmic beats rather than semantically coherent, contextually appropriate gestures aligned with speech content. We propose a two-stage generative framework: (1) a vector-quantized variational autoencoder (VQ-VAE) models motion priors; and (2) a joint alignment module fuses acoustic, linguistic, and speaker identity features, augmented by a novel multi-level semantic consistency learning mechanism that jointly grounds fine-grained and global semantics—a first in gesture synthesis. Evaluated on two benchmark datasets, our method achieves significant improvements over state-of-the-art approaches in both objective metrics (FID, MSE) and subjective user assessments (naturalness, semantic relevance), markedly enhancing semantic fidelity and interactive naturalness of generated gestures.

Technology Category

Application Category

📝 Abstract

Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model, code, dataset and pre-trained models can be viewed at https://semgesture.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Generating semantically coherent gestures aligned with speech

Overcoming neglect of semantic context in gesture generation

Ensuring consistency between gestures and speech semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates semantic coherence and relevance learning

Uses vector-quantized variational autoencoder for motion prior

Generates gestures from speech, text, and speaker identity

🔎 Similar Papers

No similar papers found.