D$^{2}$Stream: Decoupled Dual-Stream Temporal-Speaker Interaction for Audio-Visual Speaker Detection

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address computational redundancy and performance bottlenecks in active speaker detection (ASD) for multi-speaker audio-visual scenes, this paper proposes a decoupled dual-stream Transformer architecture: a temporal interaction stream models inter-frame dynamics, while a speaker interaction stream captures intra-frame person-level relationships; cross-modal attention aligns audio-visual features with interaction representations. We innovatively introduce a lightweight Voice Gate module to suppress non-speech facial motion interference and establish, for the first time, an explicit decoupling paradigm between temporal and speaker interactions. Evaluated on AVA-ActiveSpeaker, our method achieves 95.6% mAP—setting a new state-of-the-art—while reducing computational cost by 80% and parameter count by 30% compared to GNN-based approaches. Moreover, it demonstrates strong generalization on Columbia ASD.

Technology Category

Application Category

📝 Abstract
Audio-visual speaker detection aims to identify the active speaker in videos by leveraging complementary audio and visual cues. Existing methods often suffer from computational inefficiency or suboptimal performance due to joint modeling of temporal and speaker interactions. We propose D$^{2}$Stream, a decoupled dual-stream framework that separates cross-frame temporal modeling from within-frame speaker discrimination. Audio and visual features are first aligned via cross-modal attention, then fed into two lightweight streams: a Temporal Interaction Stream captures long-range temporal dependencies, while a Speaker Interaction Stream models per-frame inter-person relationships. The temporal and relational features extracted by the two streams interact via cross-attention to enrich representations. A lightweight Voice Gate module further mitigates false positives from non-speech facial movements. On AVA-ActiveSpeaker, D$^{2}$Stream achieves a new state-of-the-art at 95.6% mAP, with 80% reduction in computation compared to GNN-based models and 30% fewer parameters than attention-based alternatives, while also generalizing well on Columbia ASD. Source code is available at https://anonymous.4open.science/r/D2STREAM.
Problem

Research questions and friction points this paper is trying to address.

Separates temporal and speaker interactions for efficiency
Reduces computational cost while maintaining detection accuracy
Mitigates false positives from non-speech facial movements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples temporal and speaker interaction streams
Uses cross-modal attention for audio-visual feature alignment
Employs lightweight Voice Gate to reduce false positives
🔎 Similar Papers
No similar papers found.
J
Junhao Xiao
Central China Normal University
S
Shun Feng
Central China Normal University
Zhiyu Wu
Zhiyu Wu
DeepSeek-AI, 北京大学
MLLMEmotion RecognitionSemi-Supervised Learning
Jianjun Li
Jianjun Li
Professor
Artificial intelligenceComputer visionVideo codingMicroelectronics3D
Z
Zhiyuan Ma
Huazhong University of Science and Technology
Y
Yi Chen
Central China Normal University