A pipeline for matching bibliographic references with incomplete metadata: experiments with Crossref and OpenCitations

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Over 698 million references in Crossref lack DOIs, severely impeding the structural construction of citation networks. This paper proposes a systematic approach integrating heuristic rules, metadata matching, and fuzzy text parsing to accurately align unstructured references to target文献 entities in OpenCitations Meta. Methodologically, it combines an interpretable rule engine with semantic matching strategies—specifically designed to enhance linkage accuracy under sparse or inconsistent metadata conditions. Contributions include: (1) a novel hybrid alignment framework balancing precision and robustness; (2) a manually curated gold standard and a validated Crossref subset for rigorous evaluation. Experimental results demonstrate high precision, substantially expanding both the coverage breadth and linkage quality of the open citation network. The method provides a reproducible, scalable technical pathway for large-scale reference DOI enrichment.

Technology Category

Application Category

📝 Abstract
While Crossref makes available more than 1.8 billion bibliographic references from publications for which it provides a DOI, more than 698 million of these references do not specify a DOI, making the creation of a formal citation link from the citing entity and the cited entity problematic. In this article, we propose an analysis of Crossref bibliographic references to show how we can use the unstructured text defining such references and the available (and partial) metadata specified in them to (a) map them to existing entities included in OpenCitations Meta and, then, (b) to enable the potential inclusion of additional and valid citations link among these entities. We have defined a precise methodology to address the analysis and run it against a manually defined Gold Standard and a subset of Crossref. While the heuristic-based tool developed has demonstrated strong matching precision and effective metadata integration, its recall limitations highlight the necessity of further enhancements to address metadata inconsistencies and leverage additional sources of citation data.
Problem

Research questions and friction points this paper is trying to address.

Matching bibliographic references with incomplete metadata to existing entities
Creating formal citation links when references lack specified DOIs
Addressing metadata inconsistencies in Crossref references using heuristic tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline matches references with incomplete metadata
Heuristic tool integrates metadata for citation links
Methodology tested against Crossref and OpenCitations datasets
M
Matteo Guenci
Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Ivan Heibi
Ivan Heibi
University of Bologna
Semantic PublishingSemantic WebData VisualisationWeb technologies
C
Chiara Parravicini
Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Silvio Peroni
Silvio Peroni
University of Bologna
Semantic PublishingSemantic WebOpen ScienceScience of ScienceScholarly Communication
M
Marta Soricetti
Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy