ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of high-quality, multi-domain named entity recognition and linking (NERL) datasets for historical Italian by introducing ENEIDE, the first publicly available silver-standard dataset for this language variety. ENEIDE comprises 2,111 documents from two scholarly digital collections spanning the 18th to 20th centuries, annotated with over 8,000 entities and partitioned into training, development, and test sets. The dataset incorporates an innovative NIL (not-in-lexicon) handling mechanism and leverages semi-automatic annotation, Wikidata entity linking, and rigorous quality control. Baseline experiments demonstrate that ENEIDE poses a substantial challenge to current NERL models, revealing a significant performance gap between zero-shot and fine-tuned approaches, while also enabling temporal disambiguation and cross-domain evaluation.
📝 Abstract
This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798--1837), and Aldo Moro Digitale, the complete works of the Italian politician Aldo Moro (1916--1978). Annotations cover multiple entity types (person, location, organization, literary work) linked to Wikidata identifiers, including NIL entities that cannot be mapped to the knowledge graph. To the best of our knowledge, ENEIDE represents the first multi-domain, publicly available NERL dataset for historical Italian with training, development, and test splits. We present a methodology for semi-automatic annotations extraction from manually curated scholarly digital editions, including quality control and annotation enhancement procedures. Baseline experiments using state-of-the-art models demonstrate the dataset's challenge for NERL and the gap between zero-shot approaches and fine-tuned models. The dataset's diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation. ENEIDE is released under a CC BY-NC-SA 4.0 license.
Problem

Research questions and friction points this paper is trying to address.

Named Entity Recognition
Named Entity Linking
Historical Italian
Silver Standard Dataset
Entity Disambiguation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Named Entity Recognition and Linking
Historical Italian
Silver Standard Dataset
Semi-automatic Annotation
Temporal Entity Disambiguation
🔎 Similar Papers
No similar papers found.