🤖 AI Summary
Existing approaches predominantly adopt a sequential generation paradigm for speech and gesture, resulting in poor multimodal synchronization and weak prosodic alignment—failing to capture the tight coupling inherent in human communication. This paper introduces the first unified text-to-speech-and-gesture joint generation framework: speech and gesture are jointly encoded as interleaved discrete token sequences and modeled autoregressively within a shared backbone, with modality-specific decoders reconstructing each output. The framework supports multi-speaker synthesis, multi-style cloning, and speech-only-driven gesture generation. Comprehensive objective and subjective evaluations demonstrate that synthesized speech matches state-of-the-art quality, while generated gestures achieve significantly higher naturalness and cross-modal synchronization compared to unimodal baselines. These results validate the efficacy of joint modeling for enabling coherent, temporally aligned multimodal expressive generation.
📝 Abstract
Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.