🤖 AI Summary
Existing noise-robust speaker verification methods rely on implicit noise suppression, making it difficult to explicitly disentangle noise from speaker-specific features—thereby limiting robustness improvements. To address this, we propose a parallel joint-learning framework that introduces, for the first time, collaboratively operating noise extraction and speech enhancement networks. Built upon a dual U-Net architecture, our approach enables end-to-end joint optimization of noise modeling and speech purification, simultaneously enhancing speech quality and explicitly preserving discriminative speaker representations. Evaluated on standard noisy benchmark datasets, our method achieves significant performance gains: the equal error rate (EER) improves by 8.4% relatively over the previous state-of-the-art. This advancement not only enhances verification accuracy under noisy conditions but also improves model interpretability through explicit noise-speaker feature separation.
📝 Abstract
Noise-robust speaker verification leverages joint learning of speech enhancement (SE) and speaker verification (SV) to improve robustness. However, prevailing approaches rely on implicit noise suppression, which struggles to separate noise from speaker characteristics as they do not explicitly distinguish noise from speech during training. Although integrating SE and SV helps, it remains limited in handling noise effectively. Meanwhile, recent SE studies suggest that explicitly modeling noise, rather than merely suppressing it, enhances noise resilience. Reflecting this, we propose ParaNoise-SV, with dual U-Nets combining a noise extraction (NE) network and a speech enhancement (SE) network. The NE U-Net explicitly models noise, while the SE U-Net refines speech with guidance from NE through parallel connections, preserving speaker-relevant features. Experimental results show that ParaNoise-SV achieves a relatively 8.4% lower equal error rate (EER) than previous joint SE-SV models.