Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in speech-driven gesture video synthesis: poor rhythmic synchronization between speech and gestures, and inconsistent modeling of multi-joint motion. We propose a two-stage diffusion-based paradigm: first, an audio-to-motion diffusion model (Cosh-DiT-A) generates temporally aligned upper-body, facial, and hand motion sequences; second, a motion-to-video diffusion model (Cosh-DiT-V) synthesizes high-fidelity videos. Our core innovation is a hybrid Diffusion Transformer architecture that jointly integrates discrete VQ-VAE representations with continuous diffusion modeling, enabling unified learning of multi-joint motion priors and cross-modal temporal alignment. Extensive evaluations demonstrate state-of-the-art performance across multiple objective metrics—including sync error, motion naturalness, and expressiveness—as well as subjective human assessments, significantly improving gesture naturalness, facial expressivity, and speech–motion synchronization accuracy.

Technology Category

Application Category

📝 Abstract
Co-speech gesture video synthesis is a challenging task that requires both probabilistic modeling of human gestures and the synthesis of realistic images that align with the rhythmic nuances of speech. To address these challenges, we propose Cosh-DiT, a Co-speech gesture video system with hybrid Diffusion Transformers that perform audio-to-motion and motion-to-video synthesis using discrete and continuous diffusion modeling, respectively. First, we introduce an audio Diffusion Transformer (Cosh-DiT-A) to synthesize expressive gesture dynamics synchronized with speech rhythms. To capture upper body, facial, and hand movement priors, we employ vector-quantized variational autoencoders (VQ-VAEs) to jointly learn their dependencies within a discrete latent space. Then, for realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer (Cosh-DiT-V) that effectively integrates spatial and temporal contexts. Extensive experiments demonstrate that our framework consistently generates lifelike videos with expressive facial expressions and natural, smooth gestures that align seamlessly with speech.
Problem

Research questions and friction points this paper is trying to address.

Synthesize co-speech gesture videos with realistic human gestures.
Align gesture dynamics with speech rhythms using audio-visual diffusion.
Generate lifelike videos with expressive facial and hand movements.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Diffusion Transformers for gesture synthesis
VQ-VAEs for learning movement dependencies
Spatial-temporal integration in video synthesis
🔎 Similar Papers
No similar papers found.