The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

📅 2026-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Citizen deliberation texts are often hindered by high noise levels and thematic heterogeneity, making them unsuitable for direct use in issue modeling and political analysis. This work introduces and formally defines the novel task of “corpus clarification,” which transforms raw citizen inputs into structured, self-contained argument units. To support this task, we construct the human-annotated GDN-CC dataset comprising 1,231 contributions and 2,285 argument units, along with a large-scale automatically annotated variant, GDN-CC-large, containing 240,000 instances. Experimental results demonstrate that fine-tuned small open-source language models achieve clarification performance on par with or superior to that of large language models, while also effectively enabling downstream opinion clustering. Our approach establishes a new paradigm for transparent and efficient analysis of political discourse.

Technology Category

Application Category

📝 Abstract
LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present GDN-CC, a manually-curated dataset of 1,231 contributions to the French Grand D\'ebat National, comprising 2,285 argumentative units annotated for argumentative structure and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release GDN-CC-large, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.
Problem

Research questions and friction points this paper is trying to address.

Corpus Clarification
Democratic Citizen Consultations
Argumentative Units
Text Standardization
Public Opinion Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Corpus Clarification
Small Language Models
Argumentative Unit Extraction
Democratic Consultation
GDN-CC Dataset
🔎 Similar Papers
No similar papers found.
P
Pierre-Antoine Lequeu
Sorbonne Université, CNRS, ISIR, Paris, France
L
Léo Labat
Sorbonne Université, CNRS, ISIR, Paris, France; Institut Polytechnique de Paris, CNRS, CREST, Paris, France
L
Laurene Cave
Sorbonne Université, STIH/CERES, Paris, France
G
Gaël Lejeune
Sorbonne Université, STIH/CERES, Paris, France
François Yvon
François Yvon
ISIR / CNRS et Sorbonne Université
Natural Language ProcessingSpeech ProcessingComputational LinguisticsMachine Translation
Benjamin Piwowarski
Benjamin Piwowarski
CNRS, ISIR, Sorbonne Université
Information RetrievalMachine LearningComputational Linguistics