🤖 AI Summary
This work addresses the vulnerability of automatic speaker verification (ASV) systems to spoofing attacks in real-world scenarios by proposing an end-to-end joint anti-spoofing and speaker verification framework. The approach integrates self-supervised speech embeddings with graph neural networks for spoof detection and employs a multi-scale lightweight 1D/2D convolutional network for speaker verification. A novel top-3-layer mixture-of-experts mechanism is introduced to fuse high- and low-level features, enhancing spoofing countermeasures, while a contrastive circle loss adaptively weights sample pairs to optimize verification performance. Evaluated on the SASV track of the WildSpoof Challenge, the system demonstrates superior robustness and significantly improved identification accuracy.
📝 Abstract
This paper presents the DFKI-Speech system developed for the WildSpoof Challenge under the Spoofing aware Automatic Speaker Verification (SASV) track. We propose a robust SASV framework in which a spoofing detector and a speaker verification (SV) network operate in tandem. The spoofing detector employs a self-supervised speech embedding extractor as the frontend, combined with a state-of-the-art graph neural network backend. In addition, a top-3 layer based mixture-of-experts (MoE) is used to fuse high-level and low-level features for effective spoofed utterance detection. For speaker verification, we adapt a low-complexity convolutional neural network that fuses 2D and 1D features at multiple scales, trained with the SphereFace loss. Additionally, contrastive circle loss is applied to adaptively weight positive and negative pairs within each training batch, enabling the network to better distinguish between hard and easy sample pairs. Finally, fixed imposter cohort based AS Norm score normalization and model ensembling are used to further enhance the discriminative capability of the speaker verification system.