🤖 AI Summary
Automatic speech recognition (ASR) and audio-visual speech recognition (AVSR) models often fail to leverage visual cues robustly under noisy conditions, especially when lip movements are occluded or unavailable.
Method: This paper proposes a noise-aware and disentangled visual modeling framework grounded in generalized scene-level visual information. It is the first to exploit non-lip visual cues—such as background scene, illumination, and motion—to explicitly model noise sources without requiring frontal speaker visibility. We introduce an audio-visual collaborative multi-head attention bridging mechanism for end-to-end joint prediction of transcriptions and noise labels. Leveraging a pre-trained audio-visual encoder, we employ a scalable audio-visual pairing pipeline to ensure strong visual-noise correlation.
Results: On multi-noise benchmarks, our method reduces word error rate (WER) by 23.6% relative to audio-only ASR, demonstrating that scene-level visual priors play a critical role in speech enhancement and noise-robust recognition.
📝 Abstract
Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. Unlike works that rely on lip motion and require the speaker's visibility, we exploit broader visual information from the environment. This allows our model to naturally filter speech from noise and improve transcription, much like humans do in noisy scenarios. Our method re-purposes pretrained speech and visual encoders, linking them with multi-headed attention. This approach enables the transcription of speech and the prediction of noise labels in video inputs. We introduce a scalable pipeline to develop audio-visual datasets, where visual cues correlate to noise in the audio. We show significant improvements over existing audio-only models in noisy scenarios. Results also highlight that visual cues play a vital role in improved transcription accuracy.