Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing approaches predominantly adopt a sequential generation paradigm for speech and gesture, resulting in poor multimodal synchronization and weak prosodic alignment—failing to capture the tight coupling inherent in human communication. This paper introduces the first unified text-to-speech-and-gesture joint generation framework: speech and gesture are jointly encoded as interleaved discrete token sequences and modeled autoregressively within a shared backbone, with modality-specific decoders reconstructing each output. The framework supports multi-speaker synthesis, multi-style cloning, and speech-only-driven gesture generation. Comprehensive objective and subjective evaluations demonstrate that synthesized speech matches state-of-the-art quality, while generated gestures achieve significantly higher naturalness and cross-modal synchronization compared to unimodal baselines. These results validate the efficacy of joint modeling for enabling coherent, temporally aligned multimodal expressive generation.

Technology Category

Application Category

📝 Abstract

Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

Problem

Research questions and friction points this paper is trying to address.

Jointly synthesizes synchronized speech and gestures from text input

Enables multi-speaker multi-style cloning and gesture-only synthesis

Improves gesture generation quality while maintaining competitive speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework synthesizes speech and gestures jointly

Interleaved token sequences in autoregressive backbone

Modality-specific decoders enable multi-speaker style cloning

🔎 Similar Papers

No similar papers found.