mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing audio-visual speech recognition (AVSR) methods are predominantly English-centric and suffer substantial performance degradation in multilingual and noisy environments, primarily due to the scarcity of high-quality multilingual audio-visual data, which impairs model generalization. To address this, we propose a novel AVSR framework tailored for multilingual, noisy scenarios. Our method introduces a pioneering modality-dropping decoding mechanism that enables joint training on both audio-video paired and unimodal samples; integrates the Whisper audio encoder with the AV-HuBERT visual encoder for the first time; and designs a cross-modal decoder. We further adopt multilingual joint fine-tuning and noise-robust training strategies. Evaluated on the 9-language MuAViC benchmark, our approach achieves state-of-the-art word error rate (WER). Across diverse noise conditions, our audio-visual model consistently outperforms the audio-only Whisper baseline, demonstrating significant improvements in both multilingual capability and noise robustness.

Technology Category

Application Category

📝 Abstract

Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.

Problem

Research questions and friction points this paper is trying to address.

Multilingual AVSR

Noisy Environment

Data Scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

AVSR

mWhisper-Flamingo

Multimodal Fusion

🔎 Similar Papers

No similar papers found.