🤖 AI Summary
This work addresses performance disparities in wake-word detection systems across demographic groups defined by gender, age, and accent. To mitigate these biases without requiring sensitive demographic labels during training—thus preserving user privacy—the authors propose a debiased training paradigm that integrates data augmentation and knowledge distillation from pretrained speech models. The approach is trained in a label-free manner on the OK Aura dataset, with demographic information used only during evaluation. Experimental results demonstrate that, compared to the baseline, the proposed method reduces prediction disparities by 39.94% for gender, 83.65% for age, and 40.48% for accent, substantially improving cross-group generalization and model fairness.
📝 Abstract
Voice-based interfaces are widely used; however, achieving fair Wake-up Word detection across diverse speaker populations remains a critical challenge due to persistent demographic biases. This study evaluates the effectiveness of demographics-agnostic training techniques in mitigating performance disparities among speakers of varying sex, age, and accent. We utilize the OK Aura database for our experiments, employing a training methodology that excludes demographic labels, which are reserved for evaluation purposes. We explore (i) data augmentation techniques to enhance model generalization and (ii) knowledge distillation of pre-trained foundational speech models. The experimental results indicate that these demographics-agnostic training techniques markedly reduce demographic bias, leading to a more equitable performance profile across different speaker groups. Specifically, one of the evaluated techniques achieves a Predictive Disparity reduction of 39.94\% for sex, 83.65\% for age, and 40.48\% for accent when compared to the baseline. This study highlights the effectiveness of label-agnostic methodologies in fostering fairness in Wake-up Word detection.