Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limited use of hand gestures in regulating prosody within current text-to-speech systems, which hinders the natural co-speech synchrony observed in human communication. To bridge this gap, the authors propose Gesture2Speech, a novel framework that introduces hand motion as an explicit prosodic control signal in neural speech synthesis for the first time. The approach incorporates a gesture–speech alignment loss to enforce fine-grained temporal synchronization and employs a multimodal mixture-of-experts architecture that fuses linguistic and gestural features within a style extraction module. A large language model–driven speech decoder then generates expressive utterances aligned with the input gestures. Experiments on the PATS dataset demonstrate that the proposed method outperforms state-of-the-art baselines in both speech naturalness and gesture–speech synchrony.

Technology Category

Application Category

📝 Abstract

Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/

Problem

Research questions and friction points this paper is trying to address.

hand gestures

speech prosody

multimodal TTS

gesture-speech synchrony

expressive speech synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

gesture-driven prosody

multimodal TTS

Mixture-of-Experts