Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an end-to-end, mask-free joint framework for audio-visual speech recognition in high-noise environments, where conventional mask-based speech enhancement methods often degrade semantic information and impair recognition performance. Instead of generating explicit noise masks, the proposed approach leverages visual signals to implicitly guide the purification of noisy audio features prior to modality fusion, thereby preserving semantic integrity and enhancing cross-modal interaction. A Conformer-based bottleneck fusion module is introduced to effectively reduce modality redundancy and enable efficient implicit audio-visual integration. Evaluated on the LRS3 benchmark under noisy conditions, the method significantly outperforms state-of-the-art mask-based baselines, demonstrating its efficacy in robust multimodal speech recognition.

Technology Category

Application Category

📝 Abstract
Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
Problem

Research questions and friction points this paper is trying to address.

audio-visual speech recognition
noise robustness
speech enhancement
feature fusion
mask-free
Innovation

Methods, ideas, or system contributions that make the work stand out.

mask-free
audio-visual speech recognition
speech enhancement
Conformer-based fusion
noise-robust
🔎 Similar Papers
No similar papers found.
L
Linzhi Wu
University of Electronic Science and Technology of China, Chengdu, China; Defense Innovation Institute, Academy of Military Sciences, Beijing, China
Xingyu Zhang
Xingyu Zhang
Horizon Robotics Inc
NLP&VLM&AD
Hao Yuan
Hao Yuan
Research Scientist, Meta Platforms, Inc.
Deep Learning
Yakun Zhang
Yakun Zhang
Harbin Institute of Technology, Shenzhen
Software EngineeringProgram AnalysisGUI AgentLarge Language Model
C
Changyan Zheng
High-tech Institute, Weifang, China; Defense Innovation Institute, Academy of Military Sciences, Beijing, China
Liang Xie
Liang Xie
Wuhan University of Technology
Time Series ForecastingCross-modal Learning
T
Tiejun Liu
University of Electronic Science and Technology of China, Chengdu, China
E
Erwei Yin
Defense Innovation Institute, Academy of Military Sciences, Beijing, China