Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the limited audio-visual speech separation performance in multi-speaker co-located scenarios, this paper proposes a plug-and-play cross-speaker attention mechanism—the first to explicitly model collaborative speaking activity cues among multiple co-occurring faces within an audio-visual speech extraction framework. Our method builds upon AV-DPRNN and AV-TFGridNet architectures, incorporating a lightweight module that fuses spatiotemporal visual features with frequency-domain audio representations, supporting arbitrary numbers of input faces. Evaluated on four benchmarks—VoxCeleb2, MISP, LRS2, and LRS3—our approach consistently outperforms baseline methods. Notably, it achieves 1.8–2.3 dB improvements in SI-SNRi under both high-overlap and sparse-overlap conditions, demonstrating superior effectiveness and strong generalization across diverse multi-talker settings.

Technology Category

Application Category

📝 Abstract

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

Problem

Research questions and friction points this paper is trying to address.

Extracting target speaker speech using visual cues

Handling co-occurring faces in multi-person environments

Improving robustness in audio-visual speaker extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play inter-speaker attention module

Processes flexible co-occurring face cues

Integrates with AV-DPRNN and AV-TFGridNet

🔎 Similar Papers

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention