NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing lip-to-speech approaches rely on mel-spectrograms as intermediate representations, inducing domain mismatch between synthesized spectrograms and vocoder training data, thereby limiting speech quality. This paper proposes the first end-to-end, differentiable lip-to-speech synthesis framework that bypasses mel-spectrogram intermediaries and directly generates high-fidelity speech from video. Our key contributions are: (1) joint modeling of F0 contour prediction and a differentiable DDSP-based speech synthesizer; (2) implicit multi-speaker representation learning without explicit speaker embeddings; and (3) full-model end-to-end joint optimization. Experiments demonstrate consistent superiority over state-of-the-art methods across objective metrics—including Mel Cepstral Distortion (MCD) and Short-Time Objective Intelligibility (STOI)—as well as subjective Mean Opinion Score (MOS). Significant improvements are achieved in naturalness, intelligibility, and speaker similarity.

Technology Category

Application Category

📝 Abstract

Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits of leveraging VSR models. However, these methods typically rely on mel-spectrograms as an intermediate representation, which may introduce a key bottleneck: the domain gap between synthetic mel-spectrograms, generated from inherently error-prone lip-to-speech mappings, and real mel-spectrograms used to train vocoders. This mismatch inevitably degrades synthesis quality. To bridge this gap, we propose Natural Lip-to-Speech (NaturalL2S), an end-to-end framework integrating acoustic inductive biases with differentiable speech generation components. Specifically, we introduce a fundamental frequency (F0) predictor to capture prosodic variations in synthesized speech. The predicted F0 then drives a Differentiable Digital Signal Processing (DDSP) synthesizer to generate a coarse signal which serves as prior information for subsequent speech synthesis. Additionally, instead of relying on a reference speaker embedding as an auxiliary input, our approach achieves satisfactory performance on speaker similarity without explicitly modelling speaker characteristics. Both objective and subjective evaluation results demonstrate that NaturalL2S can effectively enhance the quality of the synthesized speech when compared to state-of-the-art methods. Our demonstration page is accessible at https://yifan-liang.github.io/NaturalL2S/.

Problem

Research questions and friction points this paper is trying to address.

Bridges domain gap in lip-to-speech synthesis

Enhances speech quality with F0 predictor

Improves speaker similarity without explicit modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end lip-to-speech synthesis framework

Integrates DDSP for speech generation

Uses F0 predictor for prosodic variations

🔎 Similar Papers

No similar papers found.