Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenging speaker diarization (SAD + diarization) problem in meeting scenarios where multi-channel training data are unavailable and microphone count and geometry are unknown. We propose a generic end-to-end framework that jointly leverages TDOA-driven robust speech segmentation and acoustic embedding clustering for speaker identity attribution. The method requires no microphone prior knowledge and uniformly supports both compact arrays and distributed microphone setups. To handle overlapping speech and dynamic speaker movement, we introduce the first spatial-spectral joint feature representation. Segmentation is guided by TDOA delay estimation, while spectral clustering ensures cross-location speaker ID consistency. Experiments demonstrate that our approach significantly outperforms the single-channel pyannote baseline under both microphone configurations—achieving breakthrough improvements in overlap-aware speech segmentation accuracy and speaker ID stability.

Technology Category

Application Category

📝 Abstract
We propose a spatio-spectral, combined model-based and data-driven diarization pipeline consisting of TDOA-based segmentation followed by embedding-based clustering. The proposed system requires neither access to multi-channel training data nor prior knowledge about the number or placement of microphones. It works for both a compact microphone array and distributed microphones, with minor adjustments. Due to its superior handling of overlapping speech during segmentation, the proposed pipeline significantly outperforms the single-channel pyannote approach, both in a scenario with a compact microphone array and in a setup with distributed microphones. Additionally, we show that, unlike fully spatial diarization pipelines, the proposed system can correctly track speakers when they change positions.
Problem

Research questions and friction points this paper is trying to address.

Combines TDOA and speaker embedding for meeting diarization
Works without multi-channel data or microphone setup knowledge
Handles overlapping speech and speaker position changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

TDOA-based segmentation for spatio-spectral diarization
Embedding-based clustering without multi-channel training
Works with compact and distributed microphone arrays
🔎 Similar Papers
No similar papers found.
Tobias Cord-Landwehr
Tobias Cord-Landwehr
Paderborn University
T
Tobias Gburrek
Paderborn University, Communications Engineering Department, Germany
M
Marc Deegen
Paderborn University, Communications Engineering Department, Germany
Reinhold Haeb-Umbach
Reinhold Haeb-Umbach
Professor of Communications Engineering, University of Paderborn
automatic speech recognitionspeech enhancementstatistical signal processing