Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current ASR research is hindered by the lack of high-quality English speech datasets that simultaneously satisfy academic openness and industrial usability: existing resources suffer from insufficient scale (e.g., LibriSpeech) or restrictive licensing, transcription errors, audio distortions, and absence of standardized evaluation sets (e.g., MOSEL, YODAS, Gigaspeech). To address this, we introduce the first 25,000-hour, commercially licensable, highly diverse English ASR dataset—covering extensive speaker variability, multiple accents, diverse speaking styles (read, spontaneous, lecture), and acoustic conditions (clean, noisy). We propose a novel multi-source heterogeneous speech fusion strategy, a quantitative metric for phonetic diversity assessment, and a standardized cleaning-and-annotation pipeline. The dataset comprises a clean subset, a noise-augmented subset, and a unified, rigorously curated evaluation set. Empirical results demonstrate substantial improvements in ASR model robustness and generalization under realistic, challenging acoustic and linguistic conditions.

Technology Category

Application Category

📝 Abstract
Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People's Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale diverse English speech datasets for ASR research
Existing datasets have restrictive licenses or unreliable transcriptions
Need for commercially usable ASR data with varied accents and speech types
Innovation

Methods, ideas, or system contributions that make the work stand out.

25,000-hour commercially usable English speech
Diverse accents and wide speech types
Designed for real-world ASR systems
🔎 Similar Papers
No similar papers found.