SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses semantic inconsistency in speech-driven gesture generation—specifically, the tendency of existing methods to produce only rhythmic beats rather than semantically coherent, contextually appropriate gestures aligned with speech content. We propose a two-stage generative framework: (1) a vector-quantized variational autoencoder (VQ-VAE) models motion priors; and (2) a joint alignment module fuses acoustic, linguistic, and speaker identity features, augmented by a novel multi-level semantic consistency learning mechanism that jointly grounds fine-grained and global semantics—a first in gesture synthesis. Evaluated on two benchmark datasets, our method achieves significant improvements over state-of-the-art approaches in both objective metrics (FID, MSE) and subjective user assessments (naturalness, semantic relevance), markedly enhancing semantic fidelity and interactive naturalness of generated gestures.

Technology Category

Application Category

📝 Abstract
Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model, code, dataset and pre-trained models can be viewed at https://semgesture.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Generating semantically coherent gestures aligned with speech
Overcoming neglect of semantic context in gesture generation
Ensuring consistency between gestures and speech semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates semantic coherence and relevance learning
Uses vector-quantized variational autoencoder for motion prior
Generates gestures from speech, text, and speaker identity
🔎 Similar Papers
No similar papers found.
L
Lanmiao Liu
Max Planck Institute for Psycholinguistics, Donders Institute for Brain Cognition and Behaviour, Utrecht University
Esam Ghaleb
Esam Ghaleb
Max Planck Institute for Psycholinguistics & Donders Centre for Brain, Cognition and Behaviour
Multimodal LearningComputer VisionBehaviour ModellingSpeech and Vision
A
Aslı Özyürek
Max Planck Institute for Psycholinguistics, Donders Institute for Brain Cognition and Behaviour
Zerrin Yumak
Zerrin Yumak
Utrecht University
Interactive virtual characterssocial robotsartificial intelligencehuman-computer interaction