🤖 AI Summary
Mainstream English automatic speech recognition (ASR) models—such as Whisper and Seamless-M4T—exhibit substantial disparities in word error rate (WER) across second-language speakers with diverse accents, reflecting poor fairness. Method: This paper proposes a fairness-aware prompt-tuning framework that jointly integrates spectral decoupling (SD), group distributionally robust optimization (Group-DRO), and invariant risk minimization (IRM) into lightweight adapter-based prompt tuning, thereby jointly optimizing cross-accent performance parity beyond standard empirical risk minimization. Contribution/Results: Experiments show that our method reduces macro-average WER by 58.7% and 58.5% relative to pretrained Whisper and Seamless-M4T, respectively—significantly outperforming conventional fine-tuning. It substantially narrows inter-accent fairness gaps and establishes a novel paradigm for building fair, robust multiaccent ASR systems.
📝 Abstract
In this work, we address the challenge of building fair English ASR systems for second-language speakers. Our analysis of widely used ASR models, Whisper and Seamless-M4T, reveals large fluctuations in word error rate (WER) across 26 accent groups, indicating significant fairness gaps. To mitigate this, we propose fairness-prompted finetuning with lightweight adapters, incorporating Spectral Decoupling (SD), Group Distributionally Robust Optimization (Group-DRO), and Invariant Risk Minimization (IRM). Our proposed fusion of traditional empirical risk minimization (ERM) with cross-entropy and fairness-driven objectives (SD, Group DRO, and IRM) enhances fairness across accent groups while maintaining overall recognition accuracy. In terms of macro-averaged word error rate, our approach achieves a relative improvement of 58.7% and 58.5% over the large pretrained Whisper and SeamlessM4T, and 9.7% and 7.8% over them, finetuning with standard empirical risk minimization with cross-entropy loss.