Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional speech separation and speaker diarization typically rely on target-speaker priors or predefined speaker counts, limiting their applicability in open-set scenarios. To address this, we propose an end-to-end joint modeling framework that requires neither speaker registration nor assumptions about the number of speakers. Our method automatically identifies and localizes target speakers via enhanced speaker embedding sampling. A two-stage training strategy coupled with an overlap-aware spectral loss explicitly models overlapping speech structure, thereby improving diarization accuracy and robustness to noise. Evaluated on standard benchmarks, our approach achieves a 71% relative reduction in diarization error rate (DER) and a 69% improvement in corrected word error rate (cpWER) over current state-of-the-art methods. These results significantly advance unsupervised, open-set speech separation and diarization.

Technology Category

Application Category

📝 Abstract
Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOTA baseline, achieving 71% relative improvement in DER and 69% in cpWER.
Problem

Research questions and friction points this paper is trying to address.

Enrollment-free speaker diarization and separation
Robust speaker representation against noise
Improved accuracy in overlapped speech frames
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enrollment-free target speaker identification method
Dual-stage training for robust speaker features
Overlapping spectral loss enhances diarization accuracy
🔎 Similar Papers
No similar papers found.
Md Asif Jalal
Md Asif Jalal
Machine Learning researcher
Machine LearningASRSpeech ProcessingAffective ComputingGenerative AI
Luca Remaggi
Luca Remaggi
Samsung R&D Institute UK (SRUK), United Kingdom
V
Vasileios Moschopoulos
Centre for Research and Technology Hellas, Greece
T
Thanasis Kotsiopoulos
Centre for Research and Technology Hellas, Greece
V
Vandana Rajan
Samsung R&D Institute UK (SRUK), United Kingdom
K
Karthikeyan Saravanan
Samsung R&D Institute UK (SRUK), United Kingdom
A
Anastasis Drosou
Centre for Research and Technology Hellas, Greece
J
Junho Heo
Language AI R&D Group (MX), Samsung Electronics, South Korea
H
Hyuk Oh
Language AI R&D Group (MX), Samsung Electronics, South Korea
S
Seokyeong Jeong
Language AI R&D Group (MX), Samsung Electronics, South Korea