ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This study addresses the challenges of structural information loss, difficulty in tracing knowledge evolution across editions, and weak inter-edition entry linkage in digitized historical encyclopedias. To tackle these issues, this work proposes the first automated framework for structured reconstruction and cross-temporal knowledge tracking tailored to four editions of the *Nordisk familjebok* (a prominent Swedish encyclopedia) published between the late 19th and mid-20th centuries. The approach integrates OCR post-processing, headword extraction, entity classification, cross-edition entry alignment, and Wikidata linking to achieve high-fidelity structural recovery and semantic interconnection. It attains an F1 score of 97.8% for headword extraction, 93.4% for entity classification, 93% precision in cross-edition matching, and 85% precision (with 16.5% recall) in Wikidata linking, thereby enabling, for the first time, systematic analysis of knowledge evolution across multiple editions of this authoritative reference work.

📝 Abstract

The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge. The lack of structure in the raw text makes it difficult to track changes across these editions. In this work, we built a pipeline to restore the text structure, where we extract the headwords and identify entries; categorize the entities; match entries across editions; and link entries to a Wikidata item. We applied this pipeline to the four major editions of \textit{Nordisk familjebok}, an authoritative Swedish encyclopedia published between 1876 and 1951. We could extract the headwords with an F1 score of 97.8\% and we obtained an F1 score of 93.4\% on the headword classification. On a small-scale evaluation, we reached a 93\% precision on the cross-edition matching, 85\% precision and 16.5\% recall on the Wikidata linking. This shows that an automated approach to digitized historical knowledge is possible. This should facilitate the preservation of general knowledge and the understanding of knowledge transmission. The datasets and programs are available online.

Problem

Research questions and friction points this paper is trying to address.

encyclopedia digitization

text structure restoration

cross-edition tracking

knowledge evolution

historical knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

encyclopedia digitization

cross-edition matching

headword extraction