Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current ASR evaluation is heavily biased toward short English utterances, lacking reproducibility, transparency, and efficiency reporting for multilingual and long-form speech scenarios. To address this, we introduce the first open-source, fully reproducible end-to-end ASR benchmark, covering 11 diverse datasets across multiple languages and long-form speech tasks. We unify text normalization, jointly report word error rate (WER) and a novel inference efficiency metric—real-time factor times (RTFx)—and standardize evaluation protocols. Our benchmark integrates state-of-the-art architectures—including Conformer, CTC, TDT, and LLM-based decoders—all fine-tuned atop Whisper encoders. All code, data preprocessing tools, and evaluation scripts are publicly released. Comprehensive evaluation of over 60 open- and closed-source systems reveals that Conformer with LLM decoding achieves the highest accuracy, while CTC and TDT exhibit superior latency-efficiency trade-offs. This work establishes the first transparent, end-to-end public benchmark for multilingual long-form ASR, enabling principled accuracy–efficiency analysis.

Technology Category

Application Category

📝 Abstract

Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.

Problem

Research questions and friction points this paper is trying to address.

Standardizing multilingual and long-form speech recognition evaluation metrics

Comparing accuracy-efficiency tradeoffs across 60+ ASR systems

Providing reproducible benchmarks for transparent ASR performance assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized text normalization for fair comparisons

Combined WER and RTFx metrics for accuracy-efficiency evaluation

Open-sourced code and dataset loaders for reproducibility

🔎 Similar Papers

No similar papers found.