๐ค AI Summary
Traditional citation metrics (e.g., h-index, citation counts) suffer from data silos and heterogeneous formats, limiting their ability to accurately quantify scientific impact. To address this, we construct a high-precision, integrated citation graph covering 140 million citation relations. We propose the first deduplication and standardization framework tailored for large-scale, cross-repository citation dataโovercoming critical bottlenecks including missing identifiers and inconsistent metadata. Leveraging distributed data cleaning, entity alignment, unique identifier generation, and graph-structure normalization, we build a high-quality integrated dataset comprising 119 million scholarly documents and 1.4 billion citation relations. Experimental results demonstrate that our citation graph significantly improves the accuracy and robustness of mainstream impact metrics. The resulting infrastructure is scalable, reproducible, and provides a foundational resource for rigorous, data-driven scientific evaluation.
๐ Abstract
This paper explores methods for building a comprehensive citation graph using big data techniques to evaluate scientific impact more accurately. Traditional citation metrics have limitations, and this work investigates merging large citation datasets to create a more accurate picture. Challenges of big data, like inconsistent data formats and lack of unique identifiers, are addressed through deduplication efforts, resulting in a streamlined and reliable merged dataset with over 119 million records and 1.4 billion citations. We demonstrate that merging large citation datasets builds a more accurate citation graph facilitating a more robust evaluation of scientific impact.