🤖 AI Summary
Existing political discourse datasets suffer from insufficient high-quality, multi-dimensional fine-grained annotations, hindering both NLP model training and empirical political science research. To address this, we introduce the first multilingual, multi-party political speech corpus for Greece’s 2023 general election—comprising 171 speeches—annotated across six tasks: content classification, topic identification, sentiment analysis, named entity recognition, polarization detection, and populism assessment. We propose a novel human-in-the-loop, two-stage annotation framework: initial labels are generated by ChatGPT, followed by rigorous expert validation led by interdisciplinary political scientists and linguists to ensure theoretical fidelity, analytical depth, and annotation efficiency. The resulting dataset is fully open-access, meticulously curated, and reproducible. It has already enabled empirical analyses during the pre-election period and serves as a benchmark resource for interdisciplinary research at the intersection of political science, computational social science, journalism studies, and AI.
📝 Abstract
Political discourse datasets are important for gaining political insights, analyzing communication strategies or social science phenomena. Although numerous political discourse corpora exist, comprehensive, high-quality, annotated datasets are scarce. This is largely due to the substantial manual effort, multidisciplinarity, and expertise required for the nuanced annotation of rhetorical strategies and ideological contexts. In this paper, we present AgoraSpeech, a meticulously curated, high-quality dataset of 171 political speeches from six parties during the Greek national elections in 2023. The dataset includes annotations (per paragraph) for six natural language processing (NLP) tasks: text classification, topic identification, sentiment analysis, named entity recognition, polarization and populism detection. A two-step annotation was employed, starting with ChatGPT-generated annotations and followed by exhaustive human-in-the-loop validation. The dataset was initially used in a case study to provide insights during the pre-election period. However, it has general applicability by serving as a rich source of information for political and social scientists, journalists, or data scientists, while it can be used for benchmarking and fine-tuning NLP and large language models (LLMs).