Mitigating Non-Target Speaker Bias in Guided Speaker Embedding

📅 2025-06-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing global statistical speaker embedding methods are highly susceptible to interference from non-target speech in multi-speaker scenarios, exhibiting significant performance degradation—particularly under low speech-overlap conditions. This paper identifies the root cause as conventional statistical pooling’s inability to distinguish between target and non-target speech segments. To address this, we propose, for the first time, an adaptive statistical modeling framework guided by a target-speaker activity mask. Specifically, we design a target-activity-aware statistical pooling module, jointly optimized with speaker activity detection within an end-to-end trainable architecture. Experiments on VoxCeleb and CALLHOME demonstrate consistent improvements: speaker verification equal error rates (EER) decrease across all settings, while speaker diarization error rates (DER) improve markedly—especially in low-overlap scenarios, where our method achieves substantial gains over prior approaches.

Technology Category

Application Category

📝 Abstract
Obtaining high-quality speaker embeddings in multi-speaker conditions is crucial for many applications. A recently proposed guided speaker embedding framework, which utilizes speech activities of target and non-target speakers as clues, drastically improved embeddings under severe overlap with small degradation in low-overlap cases. However, since extreme overlaps are rare in natural conversations, this degradation cannot be overlooked. This paper first reveals that the degradation is caused by the global-statistics-based modules, widely used in speaker embedding extractors, being overly sensitive to intervals containing only non-target speakers. As a countermeasure, we propose an extension of such modules that exploit the target speaker activity clues, to compute statistics from intervals where the target is active. The proposed method improves speaker verification performance in both low and high overlap ratios, and diarization performance on multiple datasets.
Problem

Research questions and friction points this paper is trying to address.

Reducing bias from non-target speakers in embeddings
Improving speaker verification across overlap conditions
Enhancing diarization accuracy using target activity clues
Innovation

Methods, ideas, or system contributions that make the work stand out.

Target speaker activity-guided statistics computation
Improved speaker verification in all overlaps
Enhanced diarization across multiple datasets