PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science

📅 2024-10-10
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
Biomedical literature—including scholarly articles, patents, and clinical trials—is highly fragmented and heterogeneous, impeding fine-grained cross-source integration. To address this, we construct a unified knowledge graph covering 36 million articles, 1.3 million patents, and 480,000 clinical trials—the first to systematically link these three literature types at both the entity level and the NIH-funded project level, augmented with NIH project metadata to enhance knowledge provenance. Methodologically, we integrate fine-grained biomedical entity recognition, high-precision author name disambiguation, multi-source citation fusion, and scalable knowledge graph construction. Our approach achieves state-of-the-art performance on both author disambiguation and biomedical entity recognition benchmarks. The resulting open knowledge infrastructure significantly advances literature mining, research evaluation, and translational medicine support by enabling robust, cross-modal knowledge discovery and evidence tracing.

Technology Category

Application Category

📝 Abstract
Papers, patents, and clinical trials are indispensable types of scientific literature in biomedicine, crucial for knowledge sharing and dissemination. However, these documents are often stored in disparate databases with varying management standards and data formats, making it challenging to form systematic, fine-grained connections among them. To address this issue, we introduce PKG2.0, a comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field. PKG2.0 integrates these previously dispersed resources through various links, including biomedical entities, author networks, citation relationships, and research projects. Fine-grained biomedical entity extraction, high-performance author name disambiguation, and multi-source citation integration have played a crucial role in the construction of the PKG dataset. Additionally, project data from the NIH Exporter enriches the dataset with metadata of NIH-funded projects and their scholarly outputs. Data validation demonstrates that PKG2.0 excels in key tasks such as author disambiguation and biomedical entity recognition. This dataset provides valuable resources for biomedical researchers, bibliometric scholars, and those engaged in literature mining.
Problem

Research questions and friction points this paper is trying to address.

Connecting disparate biomedical papers, patents, and clinical trials
Integrating fragmented resources with varying standards and formats
Enabling systematic knowledge linkages in biomedical research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates papers, patents, and clinical trials
Links biomedical entities, citations, and projects
Uses entity extraction and author disambiguation
🔎 Similar Papers
No similar papers found.
J
Jian Xu
School of Information Management, Sun Yat-sen University, Guangzhou, China
C
Chao Yu
School of Information Management, Sun Yat-sen University, Guangzhou, China
J
Jiawei Xu
School of Information, University of Texas at Austin, Austin, TX, USA
Ying Ding
Ying Ding
Bill & Lewis Suit Professor, School of Information, Dell Med, University of Texas at Austin
AI in HealthKnowledge GraphScience of Science
Vetle I. Torvik
Vetle I. Torvik
School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA
J
Jaewoo Kang
Department of Computer Science and Engineering, Korea University, Seoul, South Korea
Mujeen Sung
Mujeen Sung
Assistant Professor at Kyung Hee University
Natural Language Processing
M
Minju Song
Department of Library and Information Science, Yonsei University, Seoul, South Korea