Analysis of ABC Frontend Audio Systems for the NIST-SRE24

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses speaker verification for conversational telephone speech (CTS) in the NIST SRE 2024 audio track. We develop highly robust speaker embedding extractors under both closed-set (trained exclusively on SRE24 telephone speech) and open-set (incorporating multilingual public data) conditions. Methodologically, we pioneer the integration of ReDimNet architecture and the XLS-R self-supervised large model into the SRE frontend, and propose a cross-lingual transfer learning paradigm based on VoxBlink2 to significantly enhance generalization on low-resource telephone speech. Our experiments combine ResNet/ReDimNet backbones, statistical and attention-based pooling, XLS-R pretraining, and VoxBlink2 fine-tuning. On SRE24, this framework achieves state-of-the-art frontend performance, substantially improving both verification accuracy and cross-channel robustness. The resulting system constitutes a reusable, cutting-edge frontend development framework for speaker verification.

Technology Category

Application Category

📝 Abstract

We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.

Problem

Research questions and friction points this paper is trying to address.

Develop optimal speaker embedding extractors for telephone speech

Compare architectures under fixed and open training conditions

Evaluate performance of models trained on multilingual VoxBlink2 dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

ResNet with diverse pooling mechanisms

ReDimNet architecture exploration

XLS-R large pre-trained model

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey