Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the underexploited potential of large language models (LLMs) in visual speech recognition (VSR). We propose a three-stage LLM-cooperative paradigm: (1) empirically uncovering the LLM scaling law for VSR tasks; (2) designing a context-guided decoding mechanism that dynamically injects both visual and linguistic contextual cues; and (3) establishing a multi-round semantic iterative refinement framework to enable cross-modal alignment fine-tuning and self-correcting decoding. Our approach overcomes the decoding bottlenecks inherent in conventional end-to-end VSR systems. Evaluated on standard benchmarks—including LRW and LRS3—our method achieves a 12.6% relative reduction in word error rate over the state of the art. These results robustly demonstrate the efficacy and generalizability of LLMs in leveraging strong linguistic priors for purely vision-based lip reading, even without audio input.

Technology Category

Application Category

📝 Abstract
Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.
Problem

Research questions and friction points this paper is trying to address.

Exploring LLM size impact on Visual Speech Recognition performance
Improving VSR accuracy with context-aware LLM decoding
Reducing recognition errors via iterative LLM output polishing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling LLM size for VSR performance
Context-aware decoding with text guidance
Iterative polishing to reduce errors
🔎 Similar Papers
No similar papers found.
Z
Zehua Liu
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Xiaolou Li
Xiaolou Li
Beijing University of Posts and Telecommunications
Speech ProcessingDeep Learning
L
Li Guo
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Lantian Li
Lantian Li
Associate Professor @ Beijing University of Posts and Telecommunications
Speech Information ProcessingDeep Learning
D
Dong Wang
Center for Speech and Language Technologies, Tsinghua University, China