mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual speech recognition (AVSR) methods are predominantly English-centric and suffer substantial performance degradation in multilingual and noisy environments, primarily due to the scarcity of high-quality multilingual audio-visual data, which impairs model generalization. To address this, we propose a novel AVSR framework tailored for multilingual, noisy scenarios. Our method introduces a pioneering modality-dropping decoding mechanism that enables joint training on both audio-video paired and unimodal samples; integrates the Whisper audio encoder with the AV-HuBERT visual encoder for the first time; and designs a cross-modal decoder. We further adopt multilingual joint fine-tuning and noise-robust training strategies. Evaluated on the 9-language MuAViC benchmark, our approach achieves state-of-the-art word error rate (WER). Across diverse noise conditions, our audio-visual model consistently outperforms the audio-only Whisper baseline, demonstrating significant improvements in both multilingual capability and noise robustness.

Technology Category

Application Category

📝 Abstract
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
Problem

Research questions and friction points this paper is trying to address.

Multilingual AVSR
Noisy Environment
Data Scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

AVSR
mWhisper-Flamingo
Multimodal Fusion
🔎 Similar Papers
No similar papers found.