🤖 AI Summary
This work addresses the limited robustness of audio-visual automatic speech recognition (AV-ASR) in noisy environments by proposing a dual-path multimodal fusion strategy that synchronously integrates visual features into both the encoder and decoder of the Whisper architecture. This approach achieves, for the first time, deep integration of visual information throughout the entire Whisper pipeline, significantly enhancing cross-modal interaction and optimizing modality-specific weight allocation. Evaluated on the LRS3 benchmark under MUSAN noise conditions, the method achieves an average word error rate (WER) of 4.08%, setting a new state-of-the-art result. Furthermore, at 0 dB signal-to-noise ratio, it yields a relative WER reduction of 57% compared to conventional intermediate fusion approaches when applied to the Whisper-medium model.
📝 Abstract
In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method -- use of visual features both in encoder and decoder (dual-use) -- to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. Third, we conduct ablation studies examining the impact of various module designs and fusion options. Fine-tuned on 1929 hours of audiovisual data, our dual-use method using Whisper medium achieves 4.08% (MUSAN babble noise) and 4.43% (NoiseX babble noise) average WER across various SNRs, thereby establishing a new state-of-the-art in noisy conditions on the LRS3 AV-ASR benchmark. Our code is at https://github.com/ifnspaml/Dual-Use-AVASR