🤖 AI Summary
Existing group distributionally robust optimization (group DRO) fails in multilingual automatic speech recognition (ASR) because the CTC loss is inherently sensitive to input sequence length, rendering losses across language groups incomparable and inducing spurious group disparities and biased group weights. This work is the first to identify and formalize this mechanism. We propose a novel group DRO framework comprising: (1) dynamically smoothed group weight updates to mitigate over-penalization of high-loss languages, and (2) length-aware equal-length batch scheduling to alleviate CTC’s length dependency. Evaluated on the ML-SUPERB 2.0 five-language benchmark, our method reduces the worst-group word error rate (WER) by 65.9% and average WER by 47.7%, with negligible computational overhead. This work provides both theoretical insight into fairness–robustness trade-offs in ASR and a practical, scalable solution for equitable multilingual model training.
📝 Abstract
Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 65.9% and the average error by up to 47.7%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.