Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This work addresses the limited robustness of audio-visual automatic speech recognition (AV-ASR) in noisy environments by proposing a dual-path multimodal fusion strategy that synchronously integrates visual features into both the encoder and decoder of the Whisper architecture. This approach achieves, for the first time, deep integration of visual information throughout the entire Whisper pipeline, significantly enhancing cross-modal interaction and optimizing modality-specific weight allocation. Evaluated on the LRS3 benchmark under MUSAN noise conditions, the method achieves an average word error rate (WER) of 4.08%, setting a new state-of-the-art result. Furthermore, at 0 dB signal-to-noise ratio, it yields a relative WER reduction of 57% compared to conventional intermediate fusion approaches when applied to the Whisper-medium model.

Technology Category

Application Category

📝 Abstract

In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method -- use of visual features both in encoder and decoder (dual-use) -- to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. Third, we conduct ablation studies examining the impact of various module designs and fusion options. Fine-tuned on 1929 hours of audiovisual data, our dual-use method using Whisper medium achieves 4.08% (MUSAN babble noise) and 4.43% (NoiseX babble noise) average WER across various SNRs, thereby establishing a new state-of-the-art in noisy conditions on the LRS3 AV-ASR benchmark. Our code is at https://github.com/ifnspaml/Dual-Use-AVASR

Problem

Research questions and friction points this paper is trying to address.

noise-robust

audiovisual ASR

visual fusion

Whisper

speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-use visual fusion

noise-robust AV-ASR

Whisper encoder-decoder