The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Multilingual automatic speech recognition (ASR) development remains uneven, with low-resource languages, accents, and dialects systematically underrepresented—hindering inclusivity and fairness in speech technologies. Method: We introduce the first comprehensive evaluation benchmark covering 200+ languages, accents, and dialects, coupled with an online dynamic assessment platform built on DynaBench that enables plug-and-play model evaluation and real-time performance comparison. Our approach integrates multilingual ASR, language identification (LID), and robust acoustic modeling, validated using large-scale low-resource speech data. Contribution/Results: The best-performing system achieves an 18% relative reduction in character error rate (CER) and a 23% absolute gain in LID accuracy on standard multilingual test sets. On accent- and dialect-specific subsets, it further reduces CER by 30.2% and improves LID accuracy by 15.7%, significantly advancing fairness, robustness, and cross-lingual generalization in multilingual ASR.

Technology Category

Application Category

📝 Abstract

Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties. To advance state-of-the-art (SOTA) ASR models, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOTA multilingual speech models. The challenge also introduces an online evaluation server based on DynaBench, allowing for flexibility in model design and architecture for participants. The challenge received 5 submissions from 3 teams, all of which outperformed our baselines. The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18% when compared to the best baseline on a general multilingual test set. On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy, showing the importance of community challenges in making speech technologies more inclusive.

Problem

Research questions and friction points this paper is trying to address.

Addressing unequal multilingual ASR performance across languages

Evaluating models on 200+ languages, accents, and dialects

Improving speech recognition accuracy for diverse language varieties

Innovation

Methods, ideas, or system contributions that make the work stand out.

New test suite with 200+ languages and dialects

Online evaluation server using DynaBench platform

Best submission achieved 30.2% lower CER

🔎 Similar Papers

No similar papers found.