CL-UZH submission to the NIST SRE 2024 Speaker Recognition Evaluation

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the NIST SRE 2024 speaker recognition evaluation, conducting audio-only and audio-visual speaker recognition under both closed-set and open-set conditions. Methodologically, we build a Kaldi-based x-vector pipeline, integrating pre-trained models from VoxCeleb2 and VoxBlink2, and propose a novel multi-stage fine-tuning strategy across datasets—including the CTS superset. Crucially, we conduct the first systematic assessment of the visual modality’s independent contribution to audio-visual speaker recognition, empirically validating its robustness gains in low-SNR and short-utterance scenarios. Experimental results demonstrate that our approach achieves competitive performance across all SRE24 subtasks; notably, it significantly outperforms the baseline under open-set audio-visual conditions. These findings confirm the efficacy of multi-source pre-training and modality-decoupled analysis for robust speaker recognition.

Technology Category

Application Category

📝 Abstract
The CL-UZH team submitted one system each for the fixed and open conditions of the NIST SRE 2024 challenge. For the closed-set condition, results for the audio-only trials were achieved using the X-vector system developed with Kaldi. For the audio-visual results we used only models developed for the visual modality. Two sets of results were submitted for the open-set and closed-set conditions, one based on a pretrained model using the VoxBlink2 and VoxCeleb2 datasets. An Xvector-based model was trained from scratch using the CTS superset dataset for the closed set. In addition to the submission of the results of the SRE24 evaluation to the competition website, we talked about the performance of the proposed systems on the SRE24 evaluation in this report.
Problem

Research questions and friction points this paper is trying to address.

Developing speaker recognition systems for NIST SRE 2024 evaluation
Comparing audio-only and visual modality models for verification
Training models using VoxBlink2, VoxCeleb2 and CTS datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

X-vector system developed with Kaldi
Models trained for visual modality
Pretrained model using VoxBlink2 and VoxCeleb2 datasets
🔎 Similar Papers
No similar papers found.