Mitigating Non-Target Speaker Bias in Guided Speaker Embedding

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing global statistical speaker embedding methods are highly susceptible to interference from non-target speech in multi-speaker scenarios, exhibiting significant performance degradation—particularly under low speech-overlap conditions. This paper identifies the root cause as conventional statistical pooling’s inability to distinguish between target and non-target speech segments. To address this, we propose, for the first time, an adaptive statistical modeling framework guided by a target-speaker activity mask. Specifically, we design a target-activity-aware statistical pooling module, jointly optimized with speaker activity detection within an end-to-end trainable architecture. Experiments on VoxCeleb and CALLHOME demonstrate consistent improvements: speaker verification equal error rates (EER) decrease across all settings, while speaker diarization error rates (DER) improve markedly—especially in low-overlap scenarios, where our method achieves substantial gains over prior approaches.

Technology Category

Application Category

📝 Abstract

Obtaining high-quality speaker embeddings in multi-speaker conditions is crucial for many applications. A recently proposed guided speaker embedding framework, which utilizes speech activities of target and non-target speakers as clues, drastically improved embeddings under severe overlap with small degradation in low-overlap cases. However, since extreme overlaps are rare in natural conversations, this degradation cannot be overlooked. This paper first reveals that the degradation is caused by the global-statistics-based modules, widely used in speaker embedding extractors, being overly sensitive to intervals containing only non-target speakers. As a countermeasure, we propose an extension of such modules that exploit the target speaker activity clues, to compute statistics from intervals where the target is active. The proposed method improves speaker verification performance in both low and high overlap ratios, and diarization performance on multiple datasets.

Problem

Research questions and friction points this paper is trying to address.

Reducing bias from non-target speakers in embeddings

Improving speaker verification across overlap conditions

Enhancing diarization accuracy using target activity clues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Target speaker activity-guided statistics computation

Improved speaker verification in all overlaps

Enhanced diarization across multiple datasets

🔎 Similar Papers

From Prejudice to Parity: A New Approach to Debiasing Large Language Model Word Embeddings