🤖 AI Summary
Current speaker verification (SV) models face multiple robustness challenges in real-world scenarios—including insufficient utterance duration, noise, channel and codec mismatch, cross-lingual variation, age disparity, and adversarial attacks—yet existing benchmarks provide incomplete and narrow evaluations. To address this, we propose the first systematic SV robustness benchmark, uniquely integrating critical real-world factors such as cross-lingual and cross-age variability, and codec-induced distortions. Our benchmark comprehensively evaluates SV models across four dimensions: acoustic degradation, environmental variation, spoofing attacks, and adversarial perturbations—combining both synthetic and authentic stress conditions. Extensive experiments reveal significant performance degradation of state-of-the-art SV models under cross-lingual, cross-age, and compressed audio conditions, while uncovering systematic robustness biases across gender, age, and language groups. This benchmark establishes a more comprehensive, fair, and practically relevant evaluation paradigm for SV systems.
📝 Abstract
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.