Leveraging Large Language Models in Visual Speech Recognition: Model Scaling, Context-Aware Decoding, and Iterative Polishing

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This paper addresses the underexploited potential of large language models (LLMs) in visual speech recognition (VSR). We propose a three-stage LLM-cooperative paradigm: (1) empirically uncovering the LLM scaling law for VSR tasks; (2) designing a context-guided decoding mechanism that dynamically injects both visual and linguistic contextual cues; and (3) establishing a multi-round semantic iterative refinement framework to enable cross-modal alignment fine-tuning and self-correcting decoding. Our approach overcomes the decoding bottlenecks inherent in conventional end-to-end VSR systems. Evaluated on standard benchmarks—including LRW and LRS3—our method achieves a 12.6% relative reduction in word error rate over the state of the art. These results robustly demonstrate the efficacy and generalizability of LLMs in leveraging strong linguistic priors for purely vision-based lip reading, even without audio input.

Technology Category

Application Category

📝 Abstract

Visual Speech Recognition (VSR) transcribes speech by analyzing lip movements. Recently, Large Language Models (LLMs) have been integrated into VSR systems, leading to notable performance improvements. However, the potential of LLMs has not been extensively studied, and how to effectively utilize LLMs in VSR tasks remains unexplored. This paper systematically explores how to better leverage LLMs for VSR tasks and provides three key contributions: (1) Scaling Test: We study how the LLM size affects VSR performance, confirming a scaling law in the VSR task. (2) Context-Aware Decoding: We add contextual text to guide the LLM decoding, improving recognition accuracy. (3) Iterative Polishing: We propose iteratively refining LLM outputs, progressively reducing recognition errors. Extensive experiments demonstrate that by these designs, the great potential of LLMs can be largely harnessed, leading to significant VSR performance improvement.

Problem

Research questions and friction points this paper is trying to address.

Exploring LLM size impact on Visual Speech Recognition performance

Improving VSR accuracy with context-aware LLM decoding

Reducing recognition errors via iterative LLM output polishing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling LLM size for VSR performance

Context-aware decoding with text guidance

Iterative polishing to reduce errors

🔎 Similar Papers

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring