🤖 AI Summary
Existing audio-driven talking-head generation methods struggle to reconstruct high-frequency facial details—particularly lip movements—resulting in reduced visual realism. To address this, we build upon Wav2Lip and propose the Spatially Optimized Vector Quantized Autoencoder (SOVQAE), which introduces Lipschitz continuity theory to formally prove VQAE’s robustness against latent-space noise and enables temporally consistent high-frequency texture reconstruction. To rigorously evaluate high-frequency fidelity, we introduce HFTK—the first benchmark dataset explicitly designed for assessing high-frequency facial details in talking-head synthesis. Extensive experiments demonstrate that our method achieves state-of-the-art performance in FID and LPIPS, improves lip-sync accuracy by 12.7%, and significantly enhances cross-domain generalization capability.
📝 Abstract
Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.