Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This paper addresses the challenge of end-to-end talking-face synthesis from text without requiring ground-truth audio, aiming to jointly generate natural speech and high-fidelity facial animation while preserving speaker identity. Methodologically, it introduces a shared latent space that jointly models Wav2Vec2-based text embeddings (from Text-to-Vec) and hierarchical speech latent variables (from HierSpeech++) to co-drive acoustic and visual generation. A two-stage training strategy is employed—first pretraining on real speech features, then fine-tuning on TTS-predicted features—to mitigate distribution shift and enforce tight audiovisual synchronization. Experiments demonstrate significant improvements over cascaded approaches, achieving superior lip-sync accuracy, facial expressiveness, speech quality (e.g., MOS), and visual realism.

Technology Category

Application Category

📝 Abstract

We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

Problem

Research questions and friction points this paper is trying to address.

Generates synchronized speech and facial motion from text

Handles distribution shifts between clean and synthesized features

Improves lip-sync accuracy and visual realism without ground-truth audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging latent speech representations from HierSpeech++

Joint conditioning using Wav2Vec2 embeddings from text

Two-stage training handles distribution shifts in features

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation