🤖 AI Summary
This study reveals a significant fairness bias in English speaker verification (SV) systems—not in discriminative performance, but in calibration quality—particularly for low-resource accents. To address this, we construct the first multi-accent fairness benchmark dataset based on VoxCeleb and propose a conditional-aware backend (DCAB) coupled with a data-balancing calibration framework, enabling the first accent-conditioned calibration optimization. Experiments show that our method reduces expected calibration error (ECE) by up to 62% on low-resource accent groups; after balanced training, multiple state-of-the-art SV systems exhibit over 50% reduction in cross-accent ECE variance, markedly improving calibration fairness. Our core contribution is establishing calibration bias as a critical dimension of SV fairness and introducing a scalable, condition-aware calibration paradigm.
📝 Abstract
Speaker verification (SV) systems are currently being used to make sensitive decisions like giving access to bank accounts or deciding whether the voice of a suspect coincides with that of the perpetrator of a crime. Ensuring that these systems are fair and do not disfavor any particular group is crucial. In this work, we analyze the performance of several state-of-the-art SV systems across groups defined by the accent of the speakers when speaking English. To this end, we curated a new dataset based on the VoxCeleb corpus where we carefully selected samples from speakers with accents from different countries. We use this dataset to evaluate system performance for several SV systems trained with VoxCeleb data. We show that, while discrimination performance is reasonably robust across accent groups, calibration performance degrades dramatically on some accents that are not well represented in the training data. Finally, we show that a simple data balancing approach mitigates this undesirable bias, being particularly effective when applied to our recently-proposed discriminative condition-aware backend.