LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge of end-to-end lip-to-speech synthesis—generating natural, intelligible speech directly from silent lip-motion videos. We propose MP-ADM, a magnitude-preserving conditional diffusion model specifically designed for lip-to-speech conversion. To our knowledge, this is the first work to incorporate magnitude preservation into lip-speech modeling, ensuring faithful spectral energy reconstruction. We further introduce MP-FiLM, a visual-conditioning mechanism that enables fine-grained, dynamic modulation of mel-spectrogram generation using lip-motion features. Speaker embeddings and the HiFi-GAN vocoder are integrated for high-fidelity waveform reconstruction. On LRS3 and TCD-TIMIT benchmarks, our method achieves state-of-the-art performance in MOS (speech quality), speaker similarity (cosine similarity), and ASR word error rate. Ablation studies and subjective listening tests confirm both effectiveness and generalization capability across speakers and utterances.

Technology Category

Application Category

📝 Abstract

We present LipDiffuser, a conditional diffusion model for lip-to-speech generation synthesizing natural and intelligible speech directly from silent video recordings. Our approach leverages the magnitude-preserving ablated diffusion model (MP-ADM) architecture as a denoiser model. To effectively condition the model, we incorporate visual features using magnitude-preserving feature-wise linear modulation (MP-FiLM) alongside speaker embeddings. A neural vocoder then reconstructs the speech waveform from the generated mel-spectrograms. Evaluations on LRS3 and TCD-TIMIT demonstrate that LipDiffuser outperforms existing lip-to-speech baselines in perceptual speech quality and speaker similarity, while remaining competitive in downstream automatic speech recognition (ASR). These findings are also supported by a formal listening experiment. Extensive ablation studies and cross-dataset evaluation confirm the effectiveness and generalization capabilities of our approach.

Problem

Research questions and friction points this paper is trying to address.

Generates speech from silent video using diffusion models

Improves lip-to-speech quality and speaker similarity

Enhances generalization with cross-dataset evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MP-ADM architecture for denoising

Incorporates MP-FiLM for visual conditioning

Employs neural vocoder for waveform reconstruction

🔎 Similar Papers

Style-Preserving Lip Sync via Audio-Aware Style Reference