SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

Diffusion models for speech-driven talking-head video generation suffer from inferior lip-sync accuracy compared to GANs and struggle to jointly optimize temporal alignment and visual fidelity. To address this, we propose the first end-to-end diffusion framework for this task. Our approach introduces joint conditional modeling over bottlenecked temporal pose frames and AV-HuBERT speech embeddings, explicitly enhancing cross-modal temporal alignment during the diffusion process. A novel conditional denoising sampling mechanism is designed to preserve fine-grained visual details while enforcing precise lip motion synchronization. Evaluated on LRS2 and LRS3, our method achieves absolute improvements of 27.7% and 62.3% in lip-sync accuracy (measured by SyncNet score), respectively, while fully retaining the high image fidelity inherent to diffusion models. This work establishes a new paradigm for generating high-fidelity, temporally coherent speech-driven avatars.

Technology Category

Application Category

📝 Abstract

Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image fidelity. Recent studies demonstrate that GAN-based and diffusion-based models achieve state-of-the-art (SOTA) performance on this task, with diffusion-based models achieving superior image fidelity but experiencing lower synchronization compared to their GAN-based counterparts. To this end, we propose SyncDiff, a simple yet effective approach to improve diffusion-based models using a temporal pose frame with information bottleneck and facial-informative audio features extracted from AVHuBERT, as conditioning input into the diffusion process. We evaluate SyncDiff on two canonical talking head datasets, LRS2 and LRS3 for direct comparison with other SOTA models. Experiments on LRS2/LRS3 datasets show that SyncDiff achieves a synchronization score 27.7%/62.3% relatively higher than previous diffusion-based methods, while preserving their high-fidelity characteristics.

Problem

Research questions and friction points this paper is trying to address.

Improves lip-speech synchronization in talking head synthesis

Enhances diffusion-based models with temporal visual prior

Achieves higher synchronization while maintaining image fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion-based models for talking head synthesis

Integrates bottlenecked temporal visual prior

Employs AVHuBERT for facial-informative audio features

🔎 Similar Papers

No similar papers found.