LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating natural and speaker-consistent speech from silent facial videos, a task often hindered by inconsistent prosody. The authors propose a novel diffusion-based approach that, for the first time, jointly leverages three complementary cues from facial images: speaker identity, linguistic content derived from lip movements, and emotional context. By integrating these multimodal signals into a unified feature fusion framework with explicit prosody guidance, the method enhances prosodic consistency in the synthesized speech. Experimental results demonstrate that the proposed approach significantly outperforms existing methods across key metrics, including global and local pitch deviation, energy consistency, and speaker similarity, thereby achieving marked improvements in both naturalness and speaker identity fidelity.

Technology Category

Application Category

📝 Abstract
Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such as LipVoicer have demonstrated impressive performance in reconstructing linguistic content, they often lack prosodic consistency. In this work, we propose LipSody, a lip-to-speech framework enhanced for prosody consistency. LipSody introduces a prosody-guiding strategy that leverages three complementary cues: speaker identity extracted from facial images, linguistic content derived from lip movements, and emotional context inferred from face video. Experimental results demonstrate that LipSody substantially improves prosody-related metrics, including global and local pitch deviations, energy consistency, and speaker similarity, compared to prior approaches.
Problem

Research questions and friction points this paper is trying to address.

lip-to-speech synthesis
prosody consistency
speech generation
facial video
audio reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

lip-to-speech synthesis
prosody consistency
diffusion-based model
speaker identity
emotional context
🔎 Similar Papers
No similar papers found.