Analysis of ABC Frontend Audio Systems for the NIST-SRE24

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses speaker verification for conversational telephone speech (CTS) in the NIST SRE 2024 audio track. We develop highly robust speaker embedding extractors under both closed-set (trained exclusively on SRE24 telephone speech) and open-set (incorporating multilingual public data) conditions. Methodologically, we pioneer the integration of ReDimNet architecture and the XLS-R self-supervised large model into the SRE frontend, and propose a cross-lingual transfer learning paradigm based on VoxBlink2 to significantly enhance generalization on low-resource telephone speech. Our experiments combine ResNet/ReDimNet backbones, statistical and attention-based pooling, XLS-R pretraining, and VoxBlink2 fine-tuning. On SRE24, this framework achieves state-of-the-art frontend performance, substantially improving both verification accuracy and cross-channel robustness. The resulting system constitutes a reusable, cutting-edge frontend development framework for speaker verification.

Technology Category

Application Category

📝 Abstract
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.
Problem

Research questions and friction points this paper is trying to address.

Develop optimal speaker embedding extractors for telephone speech
Compare architectures under fixed and open training conditions
Evaluate performance of models trained on multilingual VoxBlink2 dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

ResNet with diverse pooling mechanisms
ReDimNet architecture exploration
XLS-R large pre-trained model
🔎 Similar Papers
No similar papers found.
S
Sara Barahona
AUDIAS, Universidad Autónoma de Madrid, Spain
Anna Silnova
Anna Silnova
Brno University of Technology
Speaker and language recognitionmachine learning
L
Ladislav Mošner
Brno University of Technology, Czechia
J
Junyi Peng
Brno University of Technology, Czechia
Oldřich Plchot
Oldřich Plchot
Researcher, Brno University of Technology
Pattern recognitionspeech processingcomputer networks
Johan Rohdin
Johan Rohdin
Brno University of Technology
Speech processingmachine learning
L
Lin Zhang
Brno University of Technology, Czechia
J
Jiangyu Han
Brno University of Technology, Czechia
P
Petr Palka
Brno University of Technology, Czechia
Federico Landini
Federico Landini
Brno University of Technology
L
Lukáš Burget
Brno University of Technology, Czechia
Themos Stafylakis
Themos Stafylakis
Assoc. Prof. at Athens Univ. of Economics and Business | Omilia | Archimedes/Athena R.C.
Voice BiometricsSpeaker RecognitionAudiovisual ASRNLPMachine Learning
S
Sandro Cumani
Politecnico di Torino, Italy
D
Dominik Boboš
Phonexia, Czechia
M
Miroslav Hlavaček
Phonexia, Czechia
M
Martin Kodovsky
Phonexia, Czechia
T
Tomaš Pavliček
Phonexia, Czechia