🤖 AI Summary
This paper addresses three core challenges in entity linking (EL) for humanities texts—particularly historical Italian documents: complex document structure, scarcity of domain-specific annotated data, and inadequate coverage of long-tail entities in knowledge bases. To tackle these, we propose DELICATE, a neuro-symbolic EL framework that integrates a BERT-based encoder with structured Wikidata context, augmented by temporal plausibility modeling and fine-grained entity type constraints to improve long-tail entity recognition and disambiguation. Concurrently, we introduce and publicly release ENEIDE, the first large-scale, multi-domain, manually annotated corpus of historical Italian texts. Experiments demonstrate that DELICATE significantly outperforms state-of-the-art models—including larger-parameter baselines—on historical EL tasks, while offering enhanced interpretability and feature sensitivity. The framework provides a reproducible, extensible, and domain-adapted solution for humanities computing.
📝 Abstract
In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.