🤖 AI Summary
This study addresses the lack of standardized formatting and section labeling in German court judgments within open legal data, which significantly hinders natural language processing tasks such as rhetorical role classification, legal retrieval, and citation analysis. To overcome this limitation, the authors present the first large-scale structured corpus of German judicial decisions, comprising 251,038 rulings meticulously extracted and segmented from Open Legal Data. The corpus systematically isolates four key sections: Tenor, Tatbestand, Entscheidungsgründe, and Rechtsmittelbelehrung. High segmentation accuracy is ensured through a combination of rule-based methods and manual validation, guided by a random sample selected via Cochran’s formula (95% confidence level, ±5% margin of error). The resulting high-quality, section-annotated dataset is publicly released in JSONL format, establishing a foundational resource for German legal NLP research.
📝 Abstract
The availability of structured legal data is important for advancing Natural Language Processing (NLP) techniques for the German legal system. One of the most widely used datasets, Open Legal Data, provides a large-scale collection of German court decisions. While the metadata in this raw dataset is consistently structured, the decision texts themselves are inconsistently formatted and often lack clearly marked sections. Reliable separation of these sections is important not only for rhetorical role classification but also for downstream tasks such as retrieval and citation analysis. In this work, we introduce a cleaned and sectioned dataset of 251,038 German court decisions derived from the official Open Legal Data dataset. We systematically separated three important sections in German court decisions, namely Tenor (operative part of the decision), Tatbestand (facts of the case), and Entscheidungsgr\"unde (judicial reasoning), which are often inconsistently represented in the original dataset. To ensure the reliability of our extraction process, we used Cochran's formula with a 95% confidence level and a 5% margin of error to draw a statistically representative random sample of 384 cases, and manually verified that all three sections were correctly identified. We also extracted the Rechtsmittelbelehrung (appeal notice) as a separate field, since it is a procedural instruction and not part of the decision itself. The resulting corpus is publicly available in the JSONL format, making it an accessible resource for further research on the German legal system.