MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the high computational cost and low efficiency of existing LLM-based audio-visual speech recognition (AVSR) systems under high temporal resolution, this paper proposes a lightweight multimodal speech LLM framework. It employs early audio-visual fusion, a dynamic duration-aware audio-visual joint Q-Former, and a speech-rate-prediction-driven adaptive query allocation mechanism to significantly compress input token length. The resulting architecture achieves an unprecedentedly low token processing rate of 3.5 tokens/s—sub-4 tokens/s for the first time—while attaining a state-of-the-art 0.74% word error rate (WER) on LRS3. Compared to prior multimodal speech LLMs, it reduces token count by 86% and FLOPs by 35.7%, striking a new Pareto-optimal balance between accuracy and efficiency. Key innovations include speech-rate-aware dynamic token allocation, a cross-modal joint Q-Former design, and an end-to-end low-token-rate large language model architecture.

Technology Category

Application Category

📝 Abstract

Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes token length while preserving essential linguistic content. Our approach employs an early av-fusion module for streamlined feature integration, an audio-visual speech Q-Former that dynamically allocates tokens based on input duration, and a refined query allocation strategy with a speech rate predictor to adjust token allocation according to speaking speed of each audio sample. Extensive experiments on the LRS3 dataset show that our method achieves state-of-the-art performance with a WER of 0.74% while using only 3.5 tokens per second. Moreover, our approach not only reduces token usage by 86% compared to the previous multimodal speech LLM framework, but also improves computational efficiency by reducing FLOPs by 35.7%.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in LLM-based AVSR systems.

Minimizes token length while preserving linguistic content.

Improves efficiency by reducing token usage and FLOPs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early av-fusion module for feature integration

Audio-visual speech Q-Former for dynamic token allocation

Speech rate predictor for refined query allocation

🔎 Similar Papers

Large Language Models Are Strong Audio-Visual Speech Recognition Learners