D$^{2}$Stream: Decoupled Dual-Stream Temporal-Speaker Interaction for Audio-Visual Speaker Detection

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

To address computational redundancy and performance bottlenecks in active speaker detection (ASD) for multi-speaker audio-visual scenes, this paper proposes a decoupled dual-stream Transformer architecture: a temporal interaction stream models inter-frame dynamics, while a speaker interaction stream captures intra-frame person-level relationships; cross-modal attention aligns audio-visual features with interaction representations. We innovatively introduce a lightweight Voice Gate module to suppress non-speech facial motion interference and establish, for the first time, an explicit decoupling paradigm between temporal and speaker interactions. Evaluated on AVA-ActiveSpeaker, our method achieves 95.6% mAP—setting a new state-of-the-art—while reducing computational cost by 80% and parameter count by 30% compared to GNN-based approaches. Moreover, it demonstrates strong generalization on Columbia ASD.

Technology Category

Application Category

📝 Abstract

Audio-visual speaker detection aims to identify the active speaker in videos by leveraging complementary audio and visual cues. Existing methods often suffer from computational inefficiency or suboptimal performance due to joint modeling of temporal and speaker interactions. We propose D$^{2}$Stream, a decoupled dual-stream framework that separates cross-frame temporal modeling from within-frame speaker discrimination. Audio and visual features are first aligned via cross-modal attention, then fed into two lightweight streams: a Temporal Interaction Stream captures long-range temporal dependencies, while a Speaker Interaction Stream models per-frame inter-person relationships. The temporal and relational features extracted by the two streams interact via cross-attention to enrich representations. A lightweight Voice Gate module further mitigates false positives from non-speech facial movements. On AVA-ActiveSpeaker, D$^{2}$Stream achieves a new state-of-the-art at 95.6% mAP, with 80% reduction in computation compared to GNN-based models and 30% fewer parameters than attention-based alternatives, while also generalizing well on Columbia ASD. Source code is available at https://anonymous.4open.science/r/D2STREAM.

Problem

Research questions and friction points this paper is trying to address.

Separates temporal and speaker interactions for efficiency

Reduces computational cost while maintaining detection accuracy

Mitigates false positives from non-speech facial movements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples temporal and speaker interaction streams

Uses cross-modal attention for audio-visual feature alignment

Employs lightweight Voice Gate to reduce false positives

🔎 Similar Papers

No similar papers found.