Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited robustness of audio-visual automatic speech recognition (AV-ASR) in noisy environments by proposing a dual-path multimodal fusion strategy that synchronously integrates visual features into both the encoder and decoder of the Whisper architecture. This approach achieves, for the first time, deep integration of visual information throughout the entire Whisper pipeline, significantly enhancing cross-modal interaction and optimizing modality-specific weight allocation. Evaluated on the LRS3 benchmark under MUSAN noise conditions, the method achieves an average word error rate (WER) of 4.08%, setting a new state-of-the-art result. Furthermore, at 0 dB signal-to-noise ratio, it yields a relative WER reduction of 57% compared to conventional intermediate fusion approaches when applied to the Whisper-medium model.

Technology Category

Application Category

📝 Abstract
In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method -- use of visual features both in encoder and decoder (dual-use) -- to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. Third, we conduct ablation studies examining the impact of various module designs and fusion options. Fine-tuned on 1929 hours of audiovisual data, our dual-use method using Whisper medium achieves 4.08% (MUSAN babble noise) and 4.43% (NoiseX babble noise) average WER across various SNRs, thereby establishing a new state-of-the-art in noisy conditions on the LRS3 AV-ASR benchmark. Our code is at https://github.com/ifnspaml/Dual-Use-AVASR
Problem

Research questions and friction points this paper is trying to address.

noise-robust
audiovisual ASR
visual fusion
Whisper
speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-use visual fusion
noise-robust AV-ASR
Whisper encoder-decoder
audiovisual interaction
multimodal weighting
Zhengyang Li
Zhengyang Li
DigiPen Institute of Technology
AI
T
Thomas Graave
Technische Universität Braunschweig, Institute for Communications Technology, Schleinitzstr. 22, 38106 Braunschweig, Germany
Björn Möller
Björn Möller
Pitch
HLA Simulation Interoperability
Z
Zehang Wu
Technische Universität Braunschweig, Institute for Communications Technology, Schleinitzstr. 22, 38106 Braunschweig, Germany
M
Matthias Franz
Technische Universität Braunschweig, Institute for Communications Technology, Schleinitzstr. 22, 38106 Braunschweig, Germany
Tim Fingscheidt
Tim Fingscheidt
Professor, IEEE Fellow, ITG Fellow, Technische Universität Braunschweig, Germany
Speech EnhancementAcoustic Signal ProcessingSpeech ProcessingEnvironment PerceptionNLP