ParCzech4Speech: A New Speech Corpus Derived from Czech Parliamentary Data

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address data scarcity in Czech speech modeling, this work introduces the Czech Parliament Speech Corpus (CPSC), the largest publicly available parliamentary speech corpus for Czech to date. CPSC comprises 2,695 hours of high-accuracy, time-aligned speech and text, sourced from official parliamentary audio recordings and transcripts. Methodologically, we propose three structured data variants—raw, segmented, and annotation-enhanced—to support diverse downstream tasks including ASR, TTS, and domain-specific adaptation. A hybrid alignment framework combining WhisperX and Wav2Vec 2.0, integrated with robust voice activity detection and fine-grained metadata preservation, significantly improves alignment accuracy and usability. The corpus is publicly released via LINDAT and Hugging Face under the CC-BY license. This resource establishes a high-quality, community-accessible benchmark for low-resource language speech research, enabling reproducible advances in Czech ASR, TTS, and multimodal language modeling.

Technology Category

Application Category

📝 Abstract

We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. Our processing pipeline improves upon the ParCzech 3.0 speech recognition version by extracting more data with higher alignment reliability. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. All variants maintain the original metadata and are released under a permissive CC-BY license. The dataset is available in the LINDAT repository, with the sentence-segmented and unsegmented variants additionally available on Hugging Face.

Problem

Research questions and friction points this paper is trying to address.

Creating a large Czech speech corpus for modeling tasks

Improving audio-text alignment reliability in parliamentary data

Providing flexible dataset variants for speech processing applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combined parliamentary recordings with official transcripts

Used WhisperX and Wav2Vec2 for audio-text alignment

Created three dataset variants for different speech tasks

🔎 Similar Papers

No similar papers found.