Large Language Models Are Strong Audio-Visual Speech Recognition Learners

📅 2024-09-18

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 1

career value

217K/year

🤖 AI Summary

Existing AVSR research predominantly focuses on the audio modality, while noise-invariant modeling of visual (lip-motion) information has been largely overlooked. To address this, we propose Llama-AVSR—the first end-to-end audio-visual speech recognition framework built upon a frozen large language model (LLaMA-3.1-8B). Our method employs modality-aware compression, freezes both audio-visual encoders and the LLM backbone, and fine-tunes only lightweight modality projection layers and LoRA adapters, enabling efficient cross-modal alignment and strong robustness to acoustic and visual noise. Evaluated on the LRS3 benchmark, Llama-AVSR achieves state-of-the-art performance with minimal parameter overhead (<0.5%): 0.79% WER for audio-only ASR and 0.77% WER for AVSR. This work establishes the first empirical validation of LLM-driven, parameter-efficient AVSR—demonstrating both feasibility and superiority of this paradigm.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities. For example, in the audio and speech domains, an LLM can be equipped with (automatic) speech recognition (ASR) abilities by just concatenating the audio tokens, computed with an audio encoder, and the text tokens to achieve state-of-the-art results. On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. It leverages pre-trained audio and video encoders to produce modality-specific tokens which, together with the text tokens, are processed by a pre-trained LLM (e.g., Llama3.1-8B) to yield the resulting response in an auto-regressive fashion. Llama-AVSR requires a small number of trainable parameters as only modality-specific projectors and LoRA modules are trained whereas the multi-modal encoders and LLM are kept frozen. We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively. To bolster our results, we investigate the key factors that underpin the effectiveness of Llama-AVSR: the choice of the pre-trained encoders and LLM, the efficient integration of LoRA modules, and the optimal performance-efficiency trade-off obtained via modality-aware compression rates.

Problem

Research questions and friction points this paper is trying to address.

Develops Llama-AVSR for audio-visual speech recognition.

Achieves state-of-the-art results on ASR and AVSR tasks.

Explores key factors for effective multimodal integration.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Llama-AVSR integrates audio, video, and text tokens.

Uses pre-trained encoders and LoRA modules efficiently.

Achieves state-of-the-art ASR and AVSR performance.

🔎 Similar Papers

No similar papers found.