🤖 AI Summary
To address the limited audio-visual speech separation performance in multi-speaker co-located scenarios, this paper proposes a plug-and-play cross-speaker attention mechanism—the first to explicitly model collaborative speaking activity cues among multiple co-occurring faces within an audio-visual speech extraction framework. Our method builds upon AV-DPRNN and AV-TFGridNet architectures, incorporating a lightweight module that fuses spatiotemporal visual features with frequency-domain audio representations, supporting arbitrary numbers of input faces. Evaluated on four benchmarks—VoxCeleb2, MISP, LRS2, and LRS3—our approach consistently outperforms baseline methods. Notably, it achieves 1.8–2.3 dB improvements in SI-SNRi under both high-overlap and sparse-overlap conditions, demonstrating superior effectiveness and strong generalization across diverse multi-talker settings.
📝 Abstract
Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.