🤖 AI Summary
Low-resource automatic speech recognition (ASR) for Slovak suffers from severe performance degradation due to insufficient training data. To address this, we introduce SloPalSpeech—the first large-scale parliamentary speech dataset for Slovak, comprising 2,806 hours of audio. We design an automated pipeline integrating forced alignment, precise utterance segmentation, and noise filtering to produce high-quality speech-text pairs. This work marks the first systematic use of parliamentary corpora for Slovak ASR data construction, significantly enhancing model generalization. Fine-tuning Whisper-small on SloPalSpeech yields up to a 70% relative reduction in word error rate (WER) on Common Voice and FLEURS benchmarks, achieving performance comparable to much larger models. We publicly release the complete dataset, a 60-million-word text corpus, and all fine-tuned models—establishing a reproducible benchmark and practical resource for low-resource ASR research.
📝 Abstract
Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model's WER dropped by up to 70%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.