🤖 AI Summary
This work proposes an end-to-end, mask-free joint framework for audio-visual speech recognition in high-noise environments, where conventional mask-based speech enhancement methods often degrade semantic information and impair recognition performance. Instead of generating explicit noise masks, the proposed approach leverages visual signals to implicitly guide the purification of noisy audio features prior to modality fusion, thereby preserving semantic integrity and enhancing cross-modal interaction. A Conformer-based bottleneck fusion module is introduced to effectively reduce modality redundancy and enable efficient implicit audio-visual integration. Evaluated on the LRS3 benchmark under noisy conditions, the method significantly outperforms state-of-the-art mask-based baselines, demonstrating its efficacy in robust multimodal speech recognition.
📝 Abstract
Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.