🤖 AI Summary
This paper addresses two critical challenges confronting automatic speech recognition (ASR) systems in real-world deployment: societal fairness—manifested as gender, accent, and age biases and their propagation to downstream tasks—and environmental sustainability—specifically the carbon emissions and energy consumption associated with large-model inference. We propose the first unified evaluation framework integrating fairness auditing with carbon footprint modeling, leveraging real-world speech data to quantify multidimensional biases, analyze downstream impact, and empirically measure inference energy consumption and associated CO₂e emissions. Experiments reveal accent- and age-related word error rate (WER) disparities of up to 42%, and a single Whisper transcription emits over 0.5 kg CO₂e. Our contributions include: (1) a reproducible dual-dimension benchmark; (2) the first empirical evidence linking bias patterns with high carbon intensity; and (3) a novel “fair–green” co-design paradigm for ASR evaluation.
📝 Abstract
In this paper, we present a bias and sustainability focused investigation of Automatic Speech Recognition (ASR) systems, namely Whisper and Massively Multilingual Speech (MMS), which have achieved state-of-the-art (SOTA) performances. Despite their improved performance in controlled settings, there remains a critical gap in understanding their efficacy and equity in real-world scenarios. We analyze ASR biases w.r.t. gender, accent, and age group, as well as their effect on downstream tasks. In addition, we examine the environmental impact of ASR systems, scrutinizing the use of large acoustic models on carbon emission and energy consumption. We also provide insights into our empirical analyses, offering a valuable contribution to the claims surrounding bias and sustainability in ASR systems.