🤖 AI Summary
This study addresses robust zero-shot automatic speech recognition (ASR) by proposing a data-quality-driven open-source modeling paradigm. To tackle the high noise levels and low pairing quality inherent in large-scale raw speech data, we design a text-informed filtering strategy and a multi-stage cleaning pipeline, yielding OLMoASR-Mix—a 1M-hour high-quality paired audio-text dataset—curated from 3M hours of raw audio; we simultaneously release OLMoASR-Pool, an open data reservoir. Leveraging this dataset, we train a family of end-to-end ASR models spanning 3.9B to 1.5B parameters. Our OLMoASR-medium.en achieves word error rates of 12.8% on short-form and 11.0% on long-form speech tasks—on par with Whisper-medium—demonstrating that synergistic integration of high-fidelity data and scalable architectures significantly enhances zero-shot robustness. This work fills a critical gap in the open-source community by delivering high-performance, publicly available ASR models.
📝 Abstract
Improvements in training data scale and quality have led to significant advances, yet its influence in speech recognition remains underexplored. In this paper, we present a large-scale dataset, OLMoASR-Pool, and series of models, OLMoASR, to study and develop robust zero-shot speech recognition models. Beginning from OLMoASR-Pool, a collection of 3M hours of English audio and 17M transcripts, we design text heuristic filters to remove low-quality or mistranscribed data. Our curation pipeline produces a new dataset containing 1M hours of high-quality audio-transcript pairs, which we call OLMoASR-Mix. We use OLMoASR-Mix to train the OLMoASR-Mix suite of models, ranging from 39M (tiny.en) to 1.5B (large.en) parameters. Across all model scales, OLMoASR achieves comparable average performance to OpenAI's Whisper on short and long-form speech recognition benchmarks. Notably, OLMoASR-medium.en attains a 12.8% and 11.0% word error rate (WER) that is on par with Whisper's largest English-only model Whisper-medium.en's 12.4% and 10.5% WER for short and long-form recognition respectively (at equivalent parameter count). OLMoASR-Pool, OLMoASR models, and filtering, training and evaluation code will be made publicly available to further research on robust speech processing.