LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

📅 2024-10-01
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven talking-head generation methods struggle to reconstruct high-frequency facial details—particularly lip movements—resulting in reduced visual realism. To address this, we build upon Wav2Lip and propose the Spatially Optimized Vector Quantized Autoencoder (SOVQAE), which introduces Lipschitz continuity theory to formally prove VQAE’s robustness against latent-space noise and enables temporally consistent high-frequency texture reconstruction. To rigorously evaluate high-frequency fidelity, we introduce HFTK—the first benchmark dataset explicitly designed for assessing high-frequency facial details in talking-head synthesis. Extensive experiments demonstrate that our method achieves state-of-the-art performance in FID and LPIPS, improves lip-sync accuracy by 12.7%, and significantly enhances cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing high-frequency details in audio-driven talking head videos
Improving noise robustness in Vector Quantised Auto Encoders
Achieving state-of-the-art video quality and lip synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pretrained Wav2Lip for audio-lip alignment
Employs SOVQAE for high-frequency detail recovery
Leverages Lipschitz Continuity for noise robustness
🔎 Similar Papers
No similar papers found.
J
Jian Yang
Psyche AI Inc.
X
Xukun Wang
Psyche AI Inc.
W
Wentao Wang
The University of Alabama at Birmingham
G
Guoming Li
Psyche AI Inc.
Qihang Fang
Qihang Fang
Psyche AI Inc.
R
Ruihong Yuan
Psyche AI Inc.
Tianyang Wang
Tianyang Wang
University of Alabama at Birmingham
machine learning (deep learning)computer vision
Z
Zhaoxin Fan
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Institute of Artificial Intelligence, Beihang University, Beijing, China; Beijing Academy of Blockchain and Edge Computing, China