🤖 AI Summary
AV-TSE aims to accurately extract target speech from multi-talker mixed audio using visual cues, yet existing methods rely heavily on lip-speech synchronization while neglecting explicit modeling of interfering speakers and background noise, resulting in limited robustness. This paper proposes an inverse selective auditory attention mechanism that explicitly estimates and suppresses noise components; designs an audio-visual fusion encoder, noise inversion modeling, and a subtraction-extraction collaborative architecture (SEANet). Evaluated on five standard benchmarks, SEANet achieves state-of-the-art performance across all nine metrics, significantly improving separation accuracy and generalization under complex acoustic conditions. The code, pre-trained models, and experimental logs will be publicly released.
📝 Abstract
Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel reverse selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. We will release the codes, the models and data logs.