🤖 AI Summary
This work addresses speaker verification for conversational telephone speech (CTS) in the NIST SRE 2024 audio track. We develop highly robust speaker embedding extractors under both closed-set (trained exclusively on SRE24 telephone speech) and open-set (incorporating multilingual public data) conditions. Methodologically, we pioneer the integration of ReDimNet architecture and the XLS-R self-supervised large model into the SRE frontend, and propose a cross-lingual transfer learning paradigm based on VoxBlink2 to significantly enhance generalization on low-resource telephone speech. Our experiments combine ResNet/ReDimNet backbones, statistical and attention-based pooling, XLS-R pretraining, and VoxBlink2 fine-tuning. On SRE24, this framework achieves state-of-the-art frontend performance, substantially improving both verification accuracy and cross-channel robustness. The resulting system constitutes a reusable, cutting-edge frontend development framework for speaker verification.
📝 Abstract
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.