🤖 AI Summary
This work addresses the inconsistent and often ad hoc evaluation of fairness in automatic speech recognition (ASR) systems across diverse speaker populations. To remedy this, the authors propose a responsible ASR fairness evaluation framework that systematically integrates principles from machine learning fairness, social science, and speech science. The framework emphasizes articulating explicit fairness assumptions, selecting context-appropriate metrics, and conducting fine-grained intersectional analyses based on demographic variables. By demonstrating how coarse-grained groupings can lead to misleading conclusions, the study exposes significant risks in current benchmarking practices and establishes a reproducible, rigorous foundation for future research on ASR fairness.
📝 Abstract
Many studies have shown automatic speech processing (ASR) systems have unequal performance across speakergroups (SG's). However, the manner in which such studies arrive at this conclusion is inconsistent. To pave the wayfor more reliable results in future studies, we lay out best practices for benchmarking ASR fairness based on literaturefrom machine learning fairness, social sciences, and speech science. We first describe the importance of preciselythe fairness hypothesis being interrogated, and tailoring fairness metrics to apply specifically to said hypothesis.We then examine several benchmarks used to rate ASR systems on fairness and discuss how their results can bemisconstrued without assiduous oversight into the intersections between SG's. We find that evaluating fairnessbased on single heterogeneous SG's, such as they are defined in fairness benchmarks, can lead to misidentifyingwhich SG's are actually being mistreated by ASR systems. We advocate for as fine-grained an analysis as possibleof the intersectionality of as many demographic variables as are available in the metadata of fairness corpora in orderto tease out such spurious correlations