CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Existing group distributionally robust optimization (group DRO) fails in multilingual automatic speech recognition (ASR) because the CTC loss is inherently sensitive to input sequence length, rendering losses across language groups incomparable and inducing spurious group disparities and biased group weights. This work is the first to identify and formalize this mechanism. We propose a novel group DRO framework comprising: (1) dynamically smoothed group weight updates to mitigate over-penalization of high-loss languages, and (2) length-aware equal-length batch scheduling to alleviate CTC’s length dependency. Evaluated on the ML-SUPERB 2.0 five-language benchmark, our method reduces the worst-group word error rate (WER) by 65.9% and average WER by 47.7%, with negligible computational overhead. This work provides both theoretical insight into fairness–robustness trade-offs in ASR and a practical, scalable solution for equitable multilingual model training.

Technology Category

Application Category

📝 Abstract

Modern deep learning models often achieve high overall performance, but consistently fail on specific subgroups. Group distributionally robust optimization (group DRO) addresses this problem by minimizing the worst-group loss, but it fails when group losses misrepresent performance differences between groups. This is common in domains like speech, where the widely used connectionist temporal classification (CTC) loss scales with input length and varies with linguistic and acoustic properties, leading to spurious differences between group losses. We present CTC-DRO, which addresses the shortcomings of the group DRO objective by smoothing the group weight update to prevent overemphasis on consistently high-loss groups, while using input length-matched batching to mitigate CTC's scaling issues. We evaluate CTC-DRO on the task of multilingual automatic speech recognition (ASR) across five language sets from the ML-SUPERB 2.0 benchmark. CTC-DRO consistently outperforms group DRO and CTC-based baseline models, reducing the worst-language error by up to 65.9% and the average error by up to 47.7%. CTC-DRO can be applied to ASR with minimal computational costs, and offers the potential for reducing group disparities in other domains with similar challenges.

Problem

Research questions and friction points this paper is trying to address.

Reducing language disparities

Improving speech recognition accuracy

Mitigating CTC loss scaling issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Smoothing group weight update

Input length-matched batching

Reducing worst-language error

🔎 Similar Papers

No similar papers found.