ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

📅 2025-11-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of Slavic parliamentary speech corpora and insufficient multimodal annotation, this study constructs a large-scale, multilingual speech corpus comprising 6,000 hours of spoken Croatian, Czech, Polish, and Serbian. Leveraging ParlaMint’s transcribed texts and metadata, we apply end-to-end automatic alignment to achieve precise speech–text synchronization. Innovatively, we enrich the textual modality with part-of-speech tagging, dependency parsing, and fine-grained sentiment prediction; for the speech modality, we perform filled-pause detection, word- and phoneme-level forced alignment, and multisyllabic word stress localization. All annotations are provided in JSONL and TextGrid formats, publicly available for download and online querying. Empirical evaluation demonstrates substantial performance gains on downstream tasks—including acoustic sentiment modeling—validating the corpus’s utility. This work delivers the first large-scale, high-precision, cross-lingual resource for Slavic parliamentary speech, advancing research in linguistics, computational social science, and speech technology.

Technology Category

Application Category

📝 Abstract
ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.
Problem

Research questions and friction points this paper is trying to address.

Creating richly annotated parliamentary speech corpora for Slavic languages
Enriching transcripts with linguistic annotations and sentiment predictions
Automatically detecting disfluencies and stress patterns in spoken data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically aligned parliamentary transcripts with speech recordings
Enriched corpora with linguistic annotations and sentiment predictions
Added disfluency detection and word-level alignment layers
🔎 Similar Papers
No similar papers found.
Nikola Ljubešić
Nikola Ljubešić
Researcher at Jožef Stefan Institute
natural language processingcomputational linguisticscomputational social science
P
Peter Rupnik
Jožef Stefan Institute, Ljubljana, Slovenia
I
Ivan Porupski
Jožef Stefan Institute, Ljubljana, Slovenia
T
Taja Kuzman Pungeršek
Jožef Stefan Institute, Ljubljana, Slovenia