🤖 AI Summary
Existing multilingual speech datasets suffer from severe undercoverage of low-resource languages, hindering the generalization capability of automatic speech recognition (ASR) models. To address this, we introduce Europarl-ST—the largest open-source multilingual speech corpus to date—comprising over 61,000 hours of high-quality, speaker-verified speech-text alignments extracted from parliamentary proceedings across 22 European languages. We propose a scalable two-stage alignment pipeline that jointly leverages media retrieval and robust speech-text alignment techniques, effectively handling non-verbatim transcripts, long-form audio, and cross-lingual temporal misalignments. The corpus fully covers all 22 languages, with 19 exceeding 1,000 hours of annotated data. Fine-tuning state-of-the-art ASR models on Europarl-ST yields an average 41.8% relative reduction in word error rate over baseline models, significantly improving recognition performance—particularly for low-resource languages.
📝 Abstract
Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.