🤖 AI Summary
To address computational redundancy and performance bottlenecks in active speaker detection (ASD) for multi-speaker audio-visual scenes, this paper proposes a decoupled dual-stream Transformer architecture: a temporal interaction stream models inter-frame dynamics, while a speaker interaction stream captures intra-frame person-level relationships; cross-modal attention aligns audio-visual features with interaction representations. We innovatively introduce a lightweight Voice Gate module to suppress non-speech facial motion interference and establish, for the first time, an explicit decoupling paradigm between temporal and speaker interactions. Evaluated on AVA-ActiveSpeaker, our method achieves 95.6% mAP—setting a new state-of-the-art—while reducing computational cost by 80% and parameter count by 30% compared to GNN-based approaches. Moreover, it demonstrates strong generalization on Columbia ASD.
📝 Abstract
Audio-visual speaker detection aims to identify the active speaker in videos by leveraging complementary audio and visual cues. Existing methods often suffer from computational inefficiency or suboptimal performance due to joint modeling of temporal and speaker interactions. We propose D$^{2}$Stream, a decoupled dual-stream framework that separates cross-frame temporal modeling from within-frame speaker discrimination. Audio and visual features are first aligned via cross-modal attention, then fed into two lightweight streams: a Temporal Interaction Stream captures long-range temporal dependencies, while a Speaker Interaction Stream models per-frame inter-person relationships. The temporal and relational features extracted by the two streams interact via cross-attention to enrich representations. A lightweight Voice Gate module further mitigates false positives from non-speech facial movements. On AVA-ActiveSpeaker, D$^{2}$Stream achieves a new state-of-the-art at 95.6% mAP, with 80% reduction in computation compared to GNN-based models and 30% fewer parameters than attention-based alternatives, while also generalizing well on Columbia ASD. Source code is available at https://anonymous.4open.science/r/D2STREAM.