Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of large-scale, high-quality annotated data on policy topics in multilingual parliamentary speeches, which has hindered comparative political agenda research. To overcome this limitation, the authors introduce ParlaCAP, a dataset comprising over 8 million multilingual speeches from 28 European parliaments. They propose a novel approach that integrates large language models (LLMs) with multilingual Transformer encoders within a teacher–student framework to automatically assign policy labels aligned with the Comparative Agendas Project (CAP) coding scheme. The method also incorporates speaker-, party-, and sentiment-level metadata. The resulting annotations achieve inter-annotator agreement comparable to human-level consistency and outperform existing models trained on out-of-domain manual annotations. This resource enables multidimensional cross-national analyses of policy attention, affective patterns, and gender disparities in legislative discourse.

Technology Category

Application Category

📝 Abstract
This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.
Problem

Research questions and friction points this paper is trying to address.

agenda setting
parliamentary speeches
policy classification
multilingual dataset
Comparative Agendas Project
Innovation

Methods, ideas, or system contributions that make the work stand out.

teacher-student framework
multilingual LLM-based classification
domain-specific policy topic classifier
ParlaCAP dataset
scalable annotation
🔎 Similar Papers
No similar papers found.
T
Taja Kuzman Pungeršek
Jožef Stefan Institute; Faculty of Computer and Information Science, University of Ljubljana; Institute of Contemporary History
P
Peter Rupnik
Jožef Stefan Institute; Faculty of Computer and Information Science, University of Ljubljana; Institute of Contemporary History
D
Daniela Širinić
Faculty of Political Science, University of Zagreb
Nikola Ljubešić
Nikola Ljubešić
Researcher at Jožef Stefan Institute
natural language processingcomputational linguisticscomputational social science