OLMoASR: Open Models and Data for Training Robust Speech Recognition Models

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

272K/year

🤖 AI Summary

This study addresses robust zero-shot automatic speech recognition (ASR) by proposing a data-quality-driven open-source modeling paradigm. To tackle the high noise levels and low pairing quality inherent in large-scale raw speech data, we design a text-informed filtering strategy and a multi-stage cleaning pipeline, yielding OLMoASR-Mix—a 1M-hour high-quality paired audio-text dataset—curated from 3M hours of raw audio; we simultaneously release OLMoASR-Pool, an open data reservoir. Leveraging this dataset, we train a family of end-to-end ASR models spanning 3.9B to 1.5B parameters. Our OLMoASR-medium.en achieves word error rates of 12.8% on short-form and 11.0% on long-form speech tasks—on par with Whisper-medium—demonstrating that synergistic integration of high-fidelity data and scalable architectures significantly enhances zero-shot robustness. This work fills a critical gap in the open-source community by delivering high-performance, publicly available ASR models.

Technology Category

Application Category

📝 Abstract

Improvements in training data scale and quality have led to significant advances, yet its influence in speech recognition remains underexplored. In this paper, we present a large-scale dataset, OLMoASR-Pool, and series of models, OLMoASR, to study and develop robust zero-shot speech recognition models. Beginning from OLMoASR-Pool, a collection of 3M hours of English audio and 17M transcripts, we design text heuristic filters to remove low-quality or mistranscribed data. Our curation pipeline produces a new dataset containing 1M hours of high-quality audio-transcript pairs, which we call OLMoASR-Mix. We use OLMoASR-Mix to train the OLMoASR-Mix suite of models, ranging from 39M (tiny.en) to 1.5B (large.en) parameters. Across all model scales, OLMoASR achieves comparable average performance to OpenAI's Whisper on short and long-form speech recognition benchmarks. Notably, OLMoASR-medium.en attains a 12.8% and 11.0% word error rate (WER) that is on par with Whisper's largest English-only model Whisper-medium.en's 12.4% and 10.5% WER for short and long-form recognition respectively (at equivalent parameter count). OLMoASR-Pool, OLMoASR models, and filtering, training and evaluation code will be made publicly available to further research on robust speech processing.

Problem

Research questions and friction points this paper is trying to address.

Studying robust zero-shot speech recognition model development

Addressing underexplored impact of training data quality

Creating high-quality filtered dataset for speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale dataset with text heuristic filters

Trained models from 39M to 1.5B parameters

Achieved comparable performance to Whisper benchmarks

🔎 Similar Papers

Comparative study on noise-augmented training and its effect on adversarial robustness in ASR systems