EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of automatically extracting geographic coordinates from 18th-century historical documents, where expressions vary widely and precision is inconsistent. The authors construct the first high-quality annotated dataset of geographic coordinates in early modern French texts, comprising 4,798 entries from Diderot and d’Alembert’s Encyclopédie. They propose a two-stage model: first, a classifier identifies entries containing coordinates; second, a Transformer-based encoder–decoder and decoder-only architecture extracts and standardizes the coordinates. Evaluated via cross-validation on the Encyclopédie, the method achieves an exact match (EM) score of 86%. It further demonstrates strong cross-lingual and cross-temporal generalization, attaining EM scores of 61% on the 18th-century French Dictionnaire de Trévoux and 77% on the 19th-century English Encyclopædia Britannica.

Technology Category

Application Category

📝 Abstract
This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.
Problem

Research questions and friction points this paper is trying to address.

historical geographic coordinates
named entity recognition
geoparsing
gold standard dataset
early modern texts
Innovation

Methods, ideas, or system contributions that make the work stand out.

historical geographic coordinates
gold standard dataset
transformer-based models
coordinate normalization
cross-lingual generalization
🔎 Similar Papers
No similar papers found.