VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the visual text-to-speech (V-TTS) task, aiming to synthesize high-fidelity speech that is speaker-cloneable, semantically faithful to input text, and strictly synchronized with lip movements in the input video. To this end, we propose VSpeechLM, a vision-speech language model. First, a text-video alignment module establishes phoneme-level lip-motion–speech correspondence, generating an extended phoneme sequence enriched with temporal synchronization cues. Second, a multimodal decoder—built upon a speech large language model—fuses cross-modal information via joint text-video embeddings and phoneme-level alignment. Evaluated on benchmarks including LRS3, our method achieves state-of-the-art performance across all three key metrics: speech quality (MOS), speaker similarity (SIM), and lip-sync error (LSE). Notably, it is the first approach to simultaneously achieve high naturalness, high-fidelity voice cloning, and frame-accurate lip synchronization.

Technology Category

Application Category

📝 Abstract
The task of Visual Text-to-Speech (VisualTTS), also known as video dubbing, aims to generate speech synchronized with the lip movements in an input video, in additional to being consistent with the content of input text and cloning the timbre of a reference speech. Existing VisualTTS models typically adopt lightweight architectures and design specialized modules to achieve the above goals respectively, yet the speech quality is not satisfied due to the model capacity and the limited data in VisualTTS. Recently, speech large language models (SpeechLLM) show the robust ability to generate high-quality speech. But few work has been done to well leverage temporal cues from video input in generating lip-synchronized speech. To generate both high-quality and lip-synchronized speech in VisualTTS tasks, we propose a novel Visual Speech Language Model called VSpeechLM based upon a SpeechLLM. To capture the synchronization relationship between text and video, we propose a text-video aligner. It first learns fine-grained alignment between phonemes and lip movements, and then outputs an expanded phoneme sequence containing lip-synchronization cues. Next, our proposed SpeechLLM based decoders take the expanded phoneme sequence as input and learns to generate lip-synchronized speech. Extensive experiments demonstrate that our VSpeechLM significantly outperforms previous VisualTTS methods in terms of overall quality, speaker similarity, and synchronization metrics.
Problem

Research questions and friction points this paper is trying to address.

Generates lip-synchronized speech from text and video
Improves speech quality in visual text-to-speech synthesis
Aligns phonemes with lip movements for synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-video aligner for phoneme-lip synchronization
SpeechLLM decoders generate lip-synchronized speech
Expanded phoneme sequence with visual cues
🔎 Similar Papers
No similar papers found.
Y
Yuyue Wang
Gaoling School of Artificial Intelligence, Renmin University of China
X
Xin Cheng
Gaoling School of Artificial Intelligence, Renmin University of China
Y
Yihan Wu
Gaoling School of Artificial Intelligence, Renmin University of China
Xihua Wang
Xihua Wang
Renmin University of China
Jinchuan Tian
Jinchuan Tian
Language Technologies Institute, Carnegie Mellon University
Speech and Language Processing
Ruihua Song
Ruihua Song
Renmin University of China
AI based creationmulti-modaltiy chitchatnatural language understandinginformation retrievalinformation extraction