A Greek Government Decisions Dataset for Public-Sector Analysis and Insight

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the problem of Greek government decision data being closed, machine-unreadable, and difficult to analyze. We construct the first open-source, large-scale, machine-readable corpus of Greek governmental decisions—drawn from the Diavgeia platform and comprising over one million documents—and propose a reproducible pipeline for precise PDF text extraction and structured annotation. Methodologically, we design a retrieval-augmented generation (RAG)-based question-answering framework tailored to governmental contexts, integrated with a high-quality QA evaluation framework combining structured information retrieval, logical reasoning, and human verification. Our contributions are threefold: (1) releasing the first high-quality Greek-language governmental corpus, filling a critical data gap for low-resource languages in law and public administration; (2) empirically validating RAG’s effectiveness for complex policy-related QA, enabling trustworthy and interpretable governmental AI assistants; and (3) providing foundational data and methodological paradigms for pretraining and domain adaptation of Greek legal large language models.

Technology Category

Application Category

📝 Abstract
We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.
Problem

Research questions and friction points this paper is trying to address.

Creating an open dataset of Greek government decisions for analysis
Developing a RAG system to retrieve and reason over public decisions
Providing a resource for training specialized language models in legal domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open dataset of Greek government decisions extracted from PDFs
Retrieval-augmented generation pipeline for question-answering on decisions
Corpus for pre-training and fine-tuning legal domain language models
🔎 Similar Papers
No similar papers found.