🤖 AI Summary
To address data scarcity in Czech speech modeling, this work introduces the Czech Parliament Speech Corpus (CPSC), the largest publicly available parliamentary speech corpus for Czech to date. CPSC comprises 2,695 hours of high-accuracy, time-aligned speech and text, sourced from official parliamentary audio recordings and transcripts. Methodologically, we propose three structured data variants—raw, segmented, and annotation-enhanced—to support diverse downstream tasks including ASR, TTS, and domain-specific adaptation. A hybrid alignment framework combining WhisperX and Wav2Vec 2.0, integrated with robust voice activity detection and fine-grained metadata preservation, significantly improves alignment accuracy and usability. The corpus is publicly released via LINDAT and Hugging Face under the CC-BY license. This resource establishes a high-quality, community-accessible benchmark for low-resource language speech research, enabling reproducible advances in Czech ASR, TTS, and multimodal language modeling.
📝 Abstract
We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. Our processing pipeline improves upon the ParCzech 3.0 speech recognition version by extracting more data with higher alignment reliability. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. All variants maintain the original metadata and are released under a permissive CC-BY license. The dataset is available in the LINDAT repository, with the sentence-segmented and unsegmented variants additionally available on Hugging Face.