Validating and monitoring bibliographic and citation data in OpenCitations collections

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the lack of systematic quality assurance for bibliographic and citation data in the OpenCitations infrastructure, this paper designs and implements an interpretable validation and dynamic quality monitoring framework tailored to the OpenCitations Data Model (OCDM). Methodologically, it integrates a customizable rule engine, SPARQL-based consistency checking, semantic constraint validation, and an incremental quality dashboard, enabling error attribution analysis and quantitative assessment. Key contributions include: (1) the first interpretable validation tool specifically designed for OCDM; and (2) a novel dynamic, sustainable quality tracking mechanism. Experimental evaluation demonstrates that the framework accurately identifies structural and semantic defects in the Matilda dataset and detects, localizes, and quantifies persistent issues—including duplication, incompleteness, and inconsistency—in OpenCitations Meta. The approach significantly enhances data reliability and fills a critical gap in systematic quality assurance for open citation data.

Technology Category

Application Category

📝 Abstract

Purpose. The increasing emphasis on data quantity in research infrastructures has highlighted the need for equally robust mechanisms ensuring data quality, particularly in bibliographic and citation datasets. This paper addresses the challenge of maintaining high-quality open research information within OpenCitations, a community-guided Open Science Infrastructure, by introducing tools for validating and monitoring bibliographic metadata and citation data. Methods. We developed a custom validation tool tailored to the OpenCitations Data Model (OCDM), designed to detect and explain ingestion errors from heterogeneous sources, whether due to upstream data inconsistencies or internal software bugs. Additionally, a quality monitoring tool was created to track known data issues post-publication. These tools were applied in two scenarios: (1) validating metadata and citations from Matilda, a potential future source, and (2) monitoring data quality in the existing OpenCitations Meta dataset. Results. The validation tool successfully identified a variety of structural and semantic issues in the Matilda dataset, demonstrating its precision. The monitoring tool enabled the detection of recurring problems in the OpenCitations Meta collection, as well as their quantification. Together, these tools proved effective in enhancing the reliability of OpenCitations' published data. Conclusion. The presented validation and monitoring tools represent a step toward ensuring high-quality bibliographic data in open research infrastructures, though they are limited to the data model adopted by OpenCitations. Future developments are aimed at expanding to additional data sources, with particular regard to crowdsourced data.

Problem

Research questions and friction points this paper is trying to address.

Ensuring quality in bibliographic and citation datasets

Validating metadata and citations from diverse sources

Monitoring data quality in open research infrastructures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom validation tool for OCDM

Quality monitoring tool post-publication

Detects structural and semantic issues

🔎 Similar Papers

No similar papers found.

OpenAI

$250K – $380K • Offers Equity

San Francisco

Member of Technical Staff (Data Intelligence)

Reka AI

San Francisco Bay Area, California / remote-first

Authors to Follow