🤖 AI Summary
Existing audio-visual speech recognition (AVSR) methods are predominantly English-centric and suffer substantial performance degradation in multilingual and noisy environments, primarily due to the scarcity of high-quality multilingual audio-visual data, which impairs model generalization. To address this, we propose a novel AVSR framework tailored for multilingual, noisy scenarios. Our method introduces a pioneering modality-dropping decoding mechanism that enables joint training on both audio-video paired and unimodal samples; integrates the Whisper audio encoder with the AV-HuBERT visual encoder for the first time; and designs a cross-modal decoder. We further adopt multilingual joint fine-tuning and noise-robust training strategies. Evaluated on the 9-language MuAViC benchmark, our approach achieves state-of-the-art word error rate (WER). Across diverse noise conditions, our audio-visual model consistently outperforms the audio-only Whisper baseline, demonstrating significant improvements in both multilingual capability and noise robustness.
📝 Abstract
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.