🤖 AI Summary
This paper addresses the longitudinal identity resolution challenge in Iceland’s 220-year (1703–1920) historical census data, where name variation, generational turnover, and complex kinship structures impede consistent cross-temporal individual identification. To this end, we introduce ICE-ID—the first long-term, longitudinal identity resolution benchmark for historical population data. Methodologically, we pioneer the application of the non-axiomatic reasoning system (NARS) to tabular historical record linkage, integrating NAL-based logical modeling, LLM-augmented models (e.g., TabTransformer), and ensemble learning. We further propose a standardized cross-wave matching task and evaluation protocol. Experiments demonstrate that NARS achieves state-of-the-art performance, significantly outperforming conventional machine learning and large language model–based baselines. The ICE-ID dataset and source code are publicly released, establishing a reproducible benchmark and a novel paradigm for historical demography, digital humanities, and temporal entity resolution.
📝 Abstract
We introduce ICE-ID, a novel benchmark dataset for historical identity resolution, comprising 220 years (1703-1920) of Icelandic census records. ICE-ID spans multiple generations of longitudinal data, capturing name variations, demographic changes, and rich genealogical links. To the best of our knowledge, this is the first large-scale, open tabular dataset specifically designed to study long-term person-entity matching in a real-world population. We define identity resolution tasks (within and across census waves) with clearly documented metrics and splits. We evaluate a range of methods: handcrafted rule-based matchers, a ML ensemble as well as LLMs for structured data (e.g. transformer-based tabular networks) against a novel approach to tabular data called NARS (Non-Axiomatic Reasoning System) - a general-purpose AI framework designed to reason with limited knowledge and resources. Its core is Non-Axiomatic Logic (NAL), a term-based logic. Our experiments show that NARS is suprisingly simple and competitive with other standard approaches, achieving SOTA at our task. By releasing ICE-ID and our code, we enable reproducible benchmarking of identity resolution approaches in longitudinal settings and hope that ICE-ID opens new avenues for cross-disciplinary research in data linkage and historical analytics.