SloPalSpeech: A 2,8000-Hour Slovak Speech Corpus from Parliamentary Data

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Low-resource automatic speech recognition (ASR) for Slovak suffers from severe performance degradation due to insufficient training data. To address this, we introduce SloPalSpeech—the first large-scale parliamentary speech dataset for Slovak, comprising 2,806 hours of audio. We design an automated pipeline integrating forced alignment, precise utterance segmentation, and noise filtering to produce high-quality speech-text pairs. This work marks the first systematic use of parliamentary corpora for Slovak ASR data construction, significantly enhancing model generalization. Fine-tuning Whisper-small on SloPalSpeech yields up to a 70% relative reduction in word error rate (WER) on Common Voice and FLEURS benchmarks, achieving performance comparable to much larger models. We publicly release the complete dataset, a 60-million-word text corpus, and all fine-tuned models—establishing a reproducible benchmark and practical resource for low-resource ASR research.

Technology Category

Application Category

📝 Abstract

Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.

Problem

Research questions and friction points this paper is trying to address.

Addressing Slovak ASR data scarcity with a 2,800-hour parliamentary corpus

Creating a robust pipeline to process long recordings for model training

Fine-tuning Whisper models to significantly reduce Slovak Word Error Rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large Slovak speech corpus from parliament

Developed pipeline for aligning and segmenting recordings

Fine-tuned Whisper models achieving major WER reductions

🔎 Similar Papers

No similar papers found.