RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address inaccurate modeling of linguistic content, accent, and prosody—leading to low naturalness—in silent-speech video-based speech reconstruction, this paper proposes an acoustic-semantic disentangled modeling framework. Grounded in the source-filter theory, it explicitly separates prosodic (source) features from linguistic and accent-related (filter) representations via a dual-path encoder architecture. For the first time, it jointly leverages unsupervised discrete speech units and mel-spectrograms to drive a neural vocoder for high-fidelity waveform synthesis. Evaluated on the two mainstream lip-to-speech benchmarks, LRS2 and LRS3, the method achieves significant improvements: 12.3% relative reduction in word error rate (WER), +0.8 MOS gain in naturalness, and +15.6% improvement in speaker identity preservation (Cosine Similarity). Multiple metrics establish new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing speech from silent videos with accuracy and naturalness

Separating prosody and linguistic features for better speech synthesis

Enhancing waveform generation using speech units and mel-spectrograms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Acoustic-semantic decomposed modeling for speech reconstruction

Integration of speech units with mel-spectrograms

Independent optimization of prosody and linguistic features

🔎 Similar Papers

No similar papers found.