🤖 AI Summary
This work addresses the challenging speaker diarization (SAD + diarization) problem in meeting scenarios where multi-channel training data are unavailable and microphone count and geometry are unknown. We propose a generic end-to-end framework that jointly leverages TDOA-driven robust speech segmentation and acoustic embedding clustering for speaker identity attribution. The method requires no microphone prior knowledge and uniformly supports both compact arrays and distributed microphone setups. To handle overlapping speech and dynamic speaker movement, we introduce the first spatial-spectral joint feature representation. Segmentation is guided by TDOA delay estimation, while spectral clustering ensures cross-location speaker ID consistency. Experiments demonstrate that our approach significantly outperforms the single-channel pyannote baseline under both microphone configurations—achieving breakthrough improvements in overlap-aware speech segmentation accuracy and speaker ID stability.
📝 Abstract
We propose a spatio-spectral, combined model-based and data-driven diarization pipeline consisting of TDOA-based segmentation followed by embedding-based clustering. The proposed system requires neither access to multi-channel training data nor prior knowledge about the number or placement of microphones. It works for both a compact microphone array and distributed microphones, with minor adjustments. Due to its superior handling of overlapping speech during segmentation, the proposed pipeline significantly outperforms the single-channel pyannote approach, both in a scenario with a compact microphone array and in a setup with distributed microphones. Additionally, we show that, unlike fully spatial diarization pipelines, the proposed system can correctly track speakers when they change positions.