🤖 AI Summary
This paper addresses the challenge of end-to-end talking-face synthesis from text without requiring ground-truth audio, aiming to jointly generate natural speech and high-fidelity facial animation while preserving speaker identity. Methodologically, it introduces a shared latent space that jointly models Wav2Vec2-based text embeddings (from Text-to-Vec) and hierarchical speech latent variables (from HierSpeech++) to co-drive acoustic and visual generation. A two-stage training strategy is employed—first pretraining on real speech features, then fine-tuning on TTS-predicted features—to mitigate distribution shift and enforce tight audiovisual synchronization. Experiments demonstrate significant improvements over cascaded approaches, achieving superior lip-sync accuracy, facial expressiveness, speech quality (e.g., MOS), and visual realism.
📝 Abstract
We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.