🤖 AI Summary
Citizen deliberation texts are often hindered by high noise levels and thematic heterogeneity, making them unsuitable for direct use in issue modeling and political analysis. This work introduces and formally defines the novel task of “corpus clarification,” which transforms raw citizen inputs into structured, self-contained argument units. To support this task, we construct the human-annotated GDN-CC dataset comprising 1,231 contributions and 2,285 argument units, along with a large-scale automatically annotated variant, GDN-CC-large, containing 240,000 instances. Experimental results demonstrate that fine-tuned small open-source language models achieve clarification performance on par with or superior to that of large language models, while also effectively enabling downstream opinion clustering. Our approach establishes a new paradigm for transparent and efficient analysis of political discourse.
📝 Abstract
LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand D\'ebat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.