LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning

๐Ÿ“… 2025-09-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Mandarin lip-to-speech (L2S) synthesis faces two key challenges: complex visemeโ€“phoneme mapping and the critical influence of lexical tone on intelligibility. To address these, we propose a tone-aware cross-lingual transfer generative model. First, we leverage English pre-trained audiovisual self-supervised models (e.g., Wav2Vec 2.0) for cross-lingual knowledge transfer to mitigate the scarcity of paired Mandarin lip-video data. Second, we incorporate discrete speech units derived from ASR fine-tuning as strong linguistic priors to explicitly guide fundamental frequency (F0) contour modeling, enabling accurate tone synthesis. Third, we integrate flow matching with a two-stage training paradigm to enhance speech naturalness. Experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving a 28.3% reduction in word error rate (WER), a 19.7% improvement in tone accuracy, and a 0.65-point gain in Mean Opinion Score (MOS) for prosodic naturalness.

Technology Category

Application Category

๐Ÿ“ Abstract
Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge, hindered by complex viseme-to-phoneme mappings and the critical role of lexical tones in intelligibility. To address this issue, we propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy. This strategy not only transfers universal knowledge learned from extensive English data to the Mandarin domain but also circumvents the prohibitive cost of training such a model from scratch. To specifically model lexical tones and enhance intelligibility, we further employ a flow-matching model to generate the F0 contour. This generation process is guided by ASR-fine-tuned SSL speech units, which contain crucial suprasegmental information. The overall speech quality is then elevated through a two-stage training paradigm, where a flow-matching postnet refines the coarse spectrogram from the first stage. Extensive experiments demonstrate that LTA-L2S significantly outperforms existing methods in both speech intelligibility and tonal accuracy.
Problem

Research questions and friction points this paper is trying to address.

Addresses Mandarin lip-to-speech synthesis challenges
Overcomes viseme-phoneme mapping complexity via cross-lingual transfer
Enhances lexical tone intelligibility using F0 contour modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual transfer learning from English pre-trained model
Flow-matching model generates F0 contour for lexical tones
Two-stage training paradigm refines spectrogram with postnet
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kang Yang
Jiangsu Key University Laboratory of Software and Media Technology under Human-Computer Cooperation, Jiangnan University, Wuxi, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Yifan Liang
Yifan Liang
Huazhong University of Science and Technology
Computer VisionMachine Learning
F
Fangkun Liu
Institute of Acoustics Chinese Academy of Science, Beijing, China
Z
Zhenping Xie
Jiangsu Key University Laboratory of Software and Media Technology under Human-Computer Cooperation, Jiangnan University, Wuxi, China; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Chengshi Zheng
Chengshi Zheng
Institute of Acoustics, Chinese Academy of Sciences
Speech enhancementmicrophone arraydeep learning