Chronicling Germany: An Annotated Historical Newspaper Dataset

📅 2024-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Historical German newspapers exhibit dense layouts and degraded typography, severely impairing layout analysis and OCR accuracy—thus hindering NLP applications in digital humanities. To address this, we introduce the first fine-grained, jointly annotated dataset for German historical newspapers (1852–1924), comprising 693 pages annotated across five structural levels: columns, articles, paragraphs, lines, and characters. Crucially, it is the first to provide synchronized layout and text annotations. Leveraging this resource, we establish a cross-domain generalization benchmark to advance low-resource historical document visual understanding. We validate its utility on standard OCR and layout analysis pipelines, achieving high in-domain accuracy. All data and code are publicly released, significantly enhancing reproducibility and scalability in historical document digitization.

Technology Category

Application Category

📝 Abstract
The correct detection of dense article layout and the recognition of characters in historical newspaper pages remains a challenging requirement for Natural Language Processing (NLP) and machine learning applications on historical newspapers in the field of digital history. Digital newspaper portals for historic Germany typically provide Optical Character Recognition (OCR) text, albeit of varying quality. Unfortunately, layout information is often missing, limiting this rich source's scope. Our dataset is designed to enable the training of layout and OCR modells for historic German-language newspapers. The Chronicling Germany dataset contains 693 annotated historical newspaper pages from the time period between 1852 and 1924. The paper presents a processing pipeline and establishes baseline results on in- and out-of-domain test data using this pipeline. Both our dataset and the corresponding baseline code are freely available online. This work creates a starting point for future research in the field of digital history and historic German language newspaper processing. Furthermore, it provides the opportunity to study a low-resource task in computer vision
Problem

Research questions and friction points this paper is trying to address.

Detecting dense article layout in historical newspapers
Improving OCR accuracy for historic German-language texts
Providing annotated dataset for digital history research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Annotated dataset for layout and OCR training
Processing pipeline for historical newspaper analysis
Baseline results on in-and out-of-domain data
🔎 Similar Papers
No similar papers found.
C
Christian Schultze
High-Performance Computing and Analytics (HPCA-Lab), Universität Bonn
N
Niklas Kerkfeld
HPCA-Lab, Universität Bonn
K
Kara Kuebart
Institut für Geschichtswissenschaft, Universität Bonn
P
Princilia Weber
Institut für Geschichtswissenschaft, Universität Bonn
M
Moritz Wolter
HPCA-Lab, Universität Bonn
F
Felix Selgert
Institut für Geschichtswissenschaft, Universität Bonn