Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the challenging speaker diarization (SAD + diarization) problem in meeting scenarios where multi-channel training data are unavailable and microphone count and geometry are unknown. We propose a generic end-to-end framework that jointly leverages TDOA-driven robust speech segmentation and acoustic embedding clustering for speaker identity attribution. The method requires no microphone prior knowledge and uniformly supports both compact arrays and distributed microphone setups. To handle overlapping speech and dynamic speaker movement, we introduce the first spatial-spectral joint feature representation. Segmentation is guided by TDOA delay estimation, while spectral clustering ensures cross-location speaker ID consistency. Experiments demonstrate that our approach significantly outperforms the single-channel pyannote baseline under both microphone configurations—achieving breakthrough improvements in overlap-aware speech segmentation accuracy and speaker ID stability.

Technology Category

Application Category

📝 Abstract

We propose a spatio-spectral, combined model-based and data-driven diarization pipeline consisting of TDOA-based segmentation followed by embedding-based clustering. The proposed system requires neither access to multi-channel training data nor prior knowledge about the number or placement of microphones. It works for both a compact microphone array and distributed microphones, with minor adjustments. Due to its superior handling of overlapping speech during segmentation, the proposed pipeline significantly outperforms the single-channel pyannote approach, both in a scenario with a compact microphone array and in a setup with distributed microphones. Additionally, we show that, unlike fully spatial diarization pipelines, the proposed system can correctly track speakers when they change positions.

Problem

Research questions and friction points this paper is trying to address.

Combines TDOA and speaker embedding for meeting diarization

Works without multi-channel data or microphone setup knowledge

Handles overlapping speech and speaker position changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

TDOA-based segmentation for spatio-spectral diarization

Embedding-based clustering without multi-channel training

Works with compact and distributed microphone arrays

🔎 Similar Papers

No similar papers found.