🤖 AI Summary
This study addresses the problem of Greek government decision data being closed, machine-unreadable, and difficult to analyze. We construct the first open-source, large-scale, machine-readable corpus of Greek governmental decisions—drawn from the Diavgeia platform and comprising over one million documents—and propose a reproducible pipeline for precise PDF text extraction and structured annotation. Methodologically, we design a retrieval-augmented generation (RAG)-based question-answering framework tailored to governmental contexts, integrated with a high-quality QA evaluation framework combining structured information retrieval, logical reasoning, and human verification. Our contributions are threefold: (1) releasing the first high-quality Greek-language governmental corpus, filling a critical data gap for low-resource languages in law and public administration; (2) empirically validating RAG’s effectiveness for complex policy-related QA, enabling trustworthy and interpretable governmental AI assistants; and (3) providing foundational data and methodological paradigms for pretraining and domain adaptation of Greek legal large language models.
📝 Abstract
We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.