EuroSpeech: A Multilingual Speech Corpus

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing multilingual speech datasets suffer from severe undercoverage of low-resource languages, hindering the generalization capability of automatic speech recognition (ASR) models. To address this, we introduce Europarl-ST—the largest open-source multilingual speech corpus to date—comprising over 61,000 hours of high-quality, speaker-verified speech-text alignments extracted from parliamentary proceedings across 22 European languages. We propose a scalable two-stage alignment pipeline that jointly leverages media retrieval and robust speech-text alignment techniques, effectively handling non-verbatim transcripts, long-form audio, and cross-lingual temporal misalignments. The corpus fully covers all 22 languages, with 19 exceeding 1,000 hours of annotated data. Fine-tuning state-of-the-art ASR models on Europarl-ST yields an average 41.8% relative reduction in word error rate over baseline models, significantly improving recognition performance—particularly for low-resource languages.

Technology Category

Application Category

📝 Abstract

Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Addressing insufficient multilingual speech data for most languages

Developing scalable pipeline for parliamentary speech dataset construction

Improving ASR performance across 22 European languages significantly

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable pipeline from parliamentary recordings

Two-stage alignment for non-verbatim transcripts

Substantial per-language coverage exceeding 500 hours

🔎 Similar Papers

No similar papers found.