Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of end-to-end talking-face synthesis from text without requiring ground-truth audio, aiming to jointly generate natural speech and high-fidelity facial animation while preserving speaker identity. Methodologically, it introduces a shared latent space that jointly models Wav2Vec2-based text embeddings (from Text-to-Vec) and hierarchical speech latent variables (from HierSpeech++) to co-drive acoustic and visual generation. A two-stage training strategy is employed—first pretraining on real speech features, then fine-tuning on TTS-predicted features—to mitigate distribution shift and enforce tight audiovisual synchronization. Experiments demonstrate significant improvements over cascaded approaches, achieving superior lip-sync accuracy, facial expressiveness, speech quality (e.g., MOS), and visual realism.

Technology Category

Application Category

📝 Abstract
We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.
Problem

Research questions and friction points this paper is trying to address.

Generates synchronized speech and facial motion from text
Handles distribution shifts between clean and synthesized features
Improves lip-sync accuracy and visual realism without ground-truth audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging latent speech representations from HierSpeech++
Joint conditioning using Wav2Vec2 embeddings from text
Two-stage training handles distribution shifts in features
🔎 Similar Papers
No similar papers found.
D
Dogucan Yaman
Karlsruhe Institute of Technology
S
Seymanur Akti
Karlsruhe Institute of Technology
F
Fevziye Irem Eyiokur
Karlsruhe Institute of Technology
Alexander Waibel
Alexander Waibel
Carnegie Mellon, KIT, Karlsruhe Institute of Technology, University of Karlsruhe
Machine LearningNeural NetworksSpeech TranslationMultimodal Interfaces