SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first end-to-end framework for lip-to-speech synthesis based on a hierarchical subspace latent diffusion model, directly mapping visual lip movements to the continuous latent space of a pretrained neural audio codec. By circumventing intermediate representations such as mel-spectrograms or self-supervised learning (SSL) tokens—which often incur information loss and hinder high-fidelity reconstruction—the method leverages subspace decomposition and diffusion-based convolutional blocks to enhance multi-scale feature interaction. It further integrates reparameterized flow matching with semantic supervision to jointly optimize a speech language model and semantic-aware losses. Evaluated across multiple benchmark datasets, the approach achieves state-of-the-art performance in both objective metrics and subjective listening tests, significantly improving the naturalness and fidelity of generated speech without relying on conventional intermediate representations.

Technology Category

Application Category

📝 Abstract
Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion models (LDMs) in this task remains largely unexplored. In this paper, we introduce SLD-L2S, a novel L2S framework built upon a hierarchical subspace latent diffusion model. Our method aims to directly map visual lip movements to the continuous latent space of a pre-trained neural audio codec, thereby avoiding the information loss inherent in traditional intermediate representations. The core of our method is a hierarchical architecture that processes visual representations through multiple parallel subspaces, initiated by a subspace decomposition module. To efficiently enhance interactions within and between these subspaces, we design the diffusion convolution block (DiCB) as our network backbone. Furthermore, we employ a reparameterized flow matching technique to directly generate the target latent vectors. This enables a principled inclusion of speech language model (SLM) and semantic losses during training, moving beyond conventional flow matching objectives and improving synthesized speech quality. Our experiments show that SLD-L2S achieves state-of-the-art generation quality on multiple benchmark datasets, surpassing existing methods in both objective and subjective evaluations.
Problem

Research questions and friction points this paper is trying to address.

lip-to-speech synthesis
latent diffusion models
intermediate representations
information loss
audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent diffusion model
hierarchical subspace
lip-to-speech synthesis
diffusion convolution block
reparameterized flow matching
🔎 Similar Papers
No similar papers found.
Yifan Liang
Yifan Liang
Huazhong University of Science and Technology
Computer VisionMachine Learning
A
Andong Li
Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
K
Kang Yang
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
G
Guochen Yu
Zhipu AI
F
Fangkun Liu
Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
L
Lingling Dai
Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
X
Xiaodong Li
Institute of Acoustics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Chengshi Zheng
Chengshi Zheng
Institute of Acoustics, Chinese Academy of Sciences
Speech enhancementmicrophone arraydeep learning