From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study investigates the functional role of large language model (LLM) decoders in visual speech recognition (VSR). Addressing the central question—whether performance gains stem from language modeling or visual understanding—we systematically evaluate freezing versus selective fine-tuning of visual encoders, scale LLM decoders up to Llama-2-13B, compare adapter-based and architectural adaptation strategies, and validate across LRS2, LRS3, and WildVSR. Results demonstrate that LLMs primarily enhance contextual reasoning rather than visual feature learning; the current bottleneck lies in limited representational capacity of visual encoders. Consequently, we identify “strengthening the visual encoder” as the critical optimization pathway. Without additional supervision, our approach achieves 24.7% WER on LRS3 and 47.0% WER on WildVSR—setting new state-of-the-art results at the time and significantly improving cross-dataset generalization.

Technology Category

Application Category

📝 Abstract

Advances in self-supervised encoders have improved Visual Speech Recognition (VSR). Recent approaches integrating these encoders with LLM decoders improves transcription accuracy; however, it remains unclear whether these gains stem from visual understanding or stronger language modeling. In this work, we systematically evaluate LLM decoders by freezing or selectively updating the visual encoder, scaling decoder size, comparing adaptation strategies and architectures, and varying training data across LRS2, LRS3, and their combination. Evaluation on LRS2, LRS3, and WildVSR shows that scaling and adaptation yield limited improvements, while combining datasets enhances generalization. Semantic analysis reveals that gains arise primarily from lexical rather than semantic processing. Our Llama-2-13B model trained on the combined set achieves 24.7% WER on LRS3 and 47.0% on WildVSR, establishing SOTA among models trained without additional supervision. Our findings indicate LLM decoders refine contextual reasoning rather than visual features, emphasizing the need for stronger visual encoders to drive meaningful progress.

Problem

Research questions and friction points this paper is trying to address.

Evaluating whether LLM decoders improve visual understanding or language modeling in VSR

Assessing impact of scaling, adaptation, and data combination on VSR performance

Determining if accuracy gains stem from lexical or semantic processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Freezing visual encoder with selective updates

Scaling decoder size and adaptation strategies

Combining datasets to enhance generalization

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions