๐ค AI Summary
To address the high computational cost and low efficiency of existing LLM-based audio-visual speech recognition (AVSR) systems under high temporal resolution, this paper proposes a lightweight multimodal speech LLM framework. It employs early audio-visual fusion, a dynamic duration-aware audio-visual joint Q-Former, and a speech-rate-prediction-driven adaptive query allocation mechanism to significantly compress input token length. The resulting architecture achieves an unprecedentedly low token processing rate of 3.5 tokens/sโsub-4 tokens/s for the first timeโwhile attaining a state-of-the-art 0.74% word error rate (WER) on LRS3. Compared to prior multimodal speech LLMs, it reduces token count by 86% and FLOPs by 35.7%, striking a new Pareto-optimal balance between accuracy and efficiency. Key innovations include speech-rate-aware dynamic token allocation, a cross-modal joint Q-Former design, and an end-to-end low-token-rate large language model architecture.
๐ Abstract
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes token length while preserving essential linguistic content. Our approach employs an early av-fusion module for streamlined feature integration, an audio-visual speech Q-Former that dynamically allocates tokens based on input duration, and a refined query allocation strategy with a speech rate predictor to adjust token allocation according to speaking speed of each audio sample. Extensive experiments on the LRS3 dataset show that our method achieves state-of-the-art performance with a WER of 0.74% while using only 3.5 tokens per second. Moreover, our approach not only reduces token usage by 86% compared to the previous multimodal speech LLM framework, but also improves computational efficiency by reducing FLOPs by 35.7%.