Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited audio-visual speech separation performance in multi-speaker co-located scenarios, this paper proposes a plug-and-play cross-speaker attention mechanism—the first to explicitly model collaborative speaking activity cues among multiple co-occurring faces within an audio-visual speech extraction framework. Our method builds upon AV-DPRNN and AV-TFGridNet architectures, incorporating a lightweight module that fuses spatiotemporal visual features with frequency-domain audio representations, supporting arbitrary numbers of input faces. Evaluated on four benchmarks—VoxCeleb2, MISP, LRS2, and LRS3—our approach consistently outperforms baseline methods. Notably, it achieves 1.8–2.3 dB improvements in SI-SNRi under both high-overlap and sparse-overlap conditions, demonstrating superior effectiveness and strong generalization across diverse multi-talker settings.

Technology Category

Application Category

📝 Abstract
Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.
Problem

Research questions and friction points this paper is trying to address.

Extracting target speaker speech using visual cues
Handling co-occurring faces in multi-person environments
Improving robustness in audio-visual speaker extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play inter-speaker attention module
Processes flexible co-occurring face cues
Integrates with AV-DPRNN and AV-TFGridNet
🔎 Similar Papers
No similar papers found.
Z
Zexu Pan
Tongyi Lab, Alibaba Group, Singapore
Shengkui Zhao
Shengkui Zhao
Senior Algorithm Expert, Alibaba Group
Speech processing and large models
T
Tingting Wang
Nanjing University of Posts and Telecommunications, NanJing, China
K
Kun Zhou
Tongyi Lab, Alibaba Group, Singapore
Yukun Ma
Yukun Ma
Alibaba Group
ASRSLU
C
Chong Zhang
Tongyi Lab, Alibaba Group, Singapore
B
Bin Ma
Tongyi Lab, Alibaba Group, Singapore