Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

📅 2026-01-18

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work proposes an end-to-end, mask-free joint framework for audio-visual speech recognition in high-noise environments, where conventional mask-based speech enhancement methods often degrade semantic information and impair recognition performance. Instead of generating explicit noise masks, the proposed approach leverages visual signals to implicitly guide the purification of noisy audio features prior to modality fusion, thereby preserving semantic integrity and enhancing cross-modal interaction. A Conformer-based bottleneck fusion module is introduced to effectively reduce modality redundancy and enable efficient implicit audio-visual integration. Evaluated on the LRS3 benchmark under noisy conditions, the method significantly outperforms state-of-the-art mask-based baselines, demonstrating its efficacy in robust multimodal speech recognition.

Technology Category

Application Category

📝 Abstract

Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.

Problem

Research questions and friction points this paper is trying to address.

audio-visual speech recognition

noise robustness

speech enhancement

feature fusion

mask-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

mask-free

audio-visual speech recognition

speech enhancement