Integrating Large Citation Datasets

๐Ÿ“… 2025-05-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Traditional citation metrics (e.g., h-index, citation counts) suffer from data silos and heterogeneous formats, limiting their ability to accurately quantify scientific impact. To address this, we construct a high-precision, integrated citation graph covering 140 million citation relations. We propose the first deduplication and standardization framework tailored for large-scale, cross-repository citation dataโ€”overcoming critical bottlenecks including missing identifiers and inconsistent metadata. Leveraging distributed data cleaning, entity alignment, unique identifier generation, and graph-structure normalization, we build a high-quality integrated dataset comprising 119 million scholarly documents and 1.4 billion citation relations. Experimental results demonstrate that our citation graph significantly improves the accuracy and robustness of mainstream impact metrics. The resulting infrastructure is scalable, reproducible, and provides a foundational resource for rigorous, data-driven scientific evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper explores methods for building a comprehensive citation graph using big data techniques to evaluate scientific impact more accurately. Traditional citation metrics have limitations, and this work investigates merging large citation datasets to create a more accurate picture. Challenges of big data, like inconsistent data formats and lack of unique identifiers, are addressed through deduplication efforts, resulting in a streamlined and reliable merged dataset with over 119 million records and 1.4 billion citations. We demonstrate that merging large citation datasets builds a more accurate citation graph facilitating a more robust evaluation of scientific impact.
Problem

Research questions and friction points this paper is trying to address.

Integrating large citation datasets for comprehensive analysis
Addressing data inconsistency in big citation datasets
Improving scientific impact evaluation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Merge large citation datasets using big data techniques
Address data inconsistencies via deduplication methods
Build accurate citation graph for impact evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
I
Inci Yueksel-Erguen
Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
I
Ida Litzel
Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
Hanqiu Peng
Hanqiu Peng
National University of Singapore
Deep LearningQuantum ComputingFinancial Mathematics