LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

📅 2024-10-01

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Existing audio-driven talking-head generation methods struggle to reconstruct high-frequency facial details—particularly lip movements—resulting in reduced visual realism. To address this, we build upon Wav2Lip and propose the Spatially Optimized Vector Quantized Autoencoder (SOVQAE), which introduces Lipschitz continuity theory to formally prove VQAE’s robustness against latent-space noise and enables temporally consistent high-frequency texture reconstruction. To rigorously evaluate high-frequency fidelity, we introduce HFTK—the first benchmark dataset explicitly designed for assessing high-frequency facial details in talking-head synthesis. Extensive experiments demonstrate that our method achieves state-of-the-art performance in FID and LPIPS, improves lip-sync accuracy by 12.7%, and significantly enhances cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.

Problem

Research questions and friction points this paper is trying to address.

Enhancing high-frequency details in audio-driven talking head videos

Improving noise robustness in Vector Quantised Auto Encoders

Achieving state-of-the-art video quality and lip synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pretrained Wav2Lip for audio-lip alignment

Employs SOVQAE for high-frequency detail recovery

Leverages Lipschitz Continuity for noise robustness

🔎 Similar Papers

No similar papers found.