🤖 AI Summary
Existing global statistical speaker embedding methods are highly susceptible to interference from non-target speech in multi-speaker scenarios, exhibiting significant performance degradation—particularly under low speech-overlap conditions. This paper identifies the root cause as conventional statistical pooling’s inability to distinguish between target and non-target speech segments. To address this, we propose, for the first time, an adaptive statistical modeling framework guided by a target-speaker activity mask. Specifically, we design a target-activity-aware statistical pooling module, jointly optimized with speaker activity detection within an end-to-end trainable architecture. Experiments on VoxCeleb and CALLHOME demonstrate consistent improvements: speaker verification equal error rates (EER) decrease across all settings, while speaker diarization error rates (DER) improve markedly—especially in low-overlap scenarios, where our method achieves substantial gains over prior approaches.
📝 Abstract
Obtaining high-quality speaker embeddings in multi-speaker conditions is crucial for many applications. A recently proposed guided speaker embedding framework, which utilizes speech activities of target and non-target speakers as clues, drastically improved embeddings under severe overlap with small degradation in low-overlap cases. However, since extreme overlaps are rare in natural conversations, this degradation cannot be overlooked. This paper first reveals that the degradation is caused by the global-statistics-based modules, widely used in speaker embedding extractors, being overly sensitive to intervals containing only non-target speakers. As a countermeasure, we propose an extension of such modules that exploit the target speaker activity clues, to compute statistics from intervals where the target is active. The proposed method improves speaker verification performance in both low and high overlap ratios, and diarization performance on multiple datasets.