AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual speech recognition (AVSR) methods typically employ unidirectional or symmetric multimodal fusion, limiting their ability to model the inherent heterogeneity and complementarity between audio and visual modalities—especially under noisy conditions. To address this, we propose AD-AVSR, a novel framework featuring asymmetric dual-stream audio encoding and a closed-loop cross-modal interaction mechanism. Specifically, it introduces an audio-aware visual refinement module and a vision-guided audio denoising mask module, enabling bidirectional enhancement and selective extraction of highly correlated features. Additionally, we incorporate a threshold-driven audio-visual pair filtering strategy and end-to-end joint training. Evaluated on LRS2 and LRS3 benchmarks, AD-AVSR achieves significant improvements in both recognition accuracy and noise robustness, establishing new state-of-the-art performance. These results demonstrate its effectiveness in modeling modality heterogeneity and enhancing robustness against acoustic interference.

Technology Category

Application Category

📝 Abstract
Audio-visual speech recognition (AVSR) combines audio-visual modalities to improve speech recognition, especially in noisy environments. However, most existing methods deploy the unidirectional enhancement or symmetric fusion manner, which limits their capability to capture heterogeneous and complementary correlations of audio-visual data-especially under asymmetric information conditions. To tackle these gaps, we introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement. Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives and intentionally establish asymmetry to support subsequent cross-modal interactions. The enhancement process involves two key components, Audio-aware Visual Refinement Module for enhanced visual representations under audio guidance, and Cross-modal Noise Suppression Masking Module which refines audio representations using visual cues, collaboratively leading to the closed-loop and bidirectional information flow. To further enhance correlation robustness, we adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs. Extensive experimental results on the LRS2 and LRS3 datasets indicate that our AD-AVSR consistently surpasses SOTA methods in both performance and noise robustness, highlighting the effectiveness of our model design.
Problem

Research questions and friction points this paper is trying to address.

Enhances audio-visual speech recognition in noisy environments
Addresses limitations of unidirectional and symmetric fusion methods
Improves robustness by filtering weakly correlated audio-visual pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional modality enhancement for AVSR
Audio dual-stream encoding for multi-perspective
Threshold-based selection for correlation robustness
🔎 Similar Papers
No similar papers found.
Junxiao Xue
Junxiao Xue
Zhejiang Lab
Computer GraphicsCrowd simulationMulti-agents ModelingMulti-modal Learning
Xiaozhen Liu
Xiaozhen Liu
Zhengzhou University
Computer VisionMultimodal Learning
X
Xuecheng Wu
Xi’an Jiaotong University
X
Xinyi Yin
Zhengzhou University
D
Danlei Huang
Xi’an Jiaotong University
F
Fei Yu
Zhejiang Lab