VitaGraph: Building a Knowledge Graph for Biologically Relevant Learning Tasks

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low-quality knowledge graphs, poor interpretability, and weak generalization in biomedical prediction tasks, this work introduces VitaGraph—a high-quality, multi-source integrated biomedical knowledge graph. We systematically curate the Drug Repurposing Knowledge Graph (DRKG) and integrate interpretable biological features—including molecular fingerprints and Gene Ontology (GO) annotations—to achieve heterogeneous data alignment and redundancy removal. We further propose a link prediction framework that jointly leverages graph neural network pretraining and interpretable feature embedding. Evaluated on drug repositioning, protein–protein interaction prediction, and polypharmacy side effect prediction, our approach achieves state-of-the-art performance across all three tasks, significantly enhancing biological plausibility and generalizability. VitaGraph is the first open-source, fully reproducible, benchmark-ready knowledge graph platform explicitly designed for precision medicine.

Technology Category

Application Category

📝 Abstract
The intrinsic complexity of human biology presents ongoing challenges to scientific understanding. Researchers collaborate across disciplines to expand our knowledge of the biological interactions that define human life. AI methodologies have emerged as powerful tools across scientific domains, particularly in computational biology, where graph data structures effectively model biological entities such as protein-protein interaction (PPI) networks and gene functional networks. Those networks are used as datasets for paramount network medicine tasks, such as gene-disease association prediction, drug repurposing, and polypharmacy side effect studies. Reliable predictions from machine learning models require high-quality foundational data. In this work, we present a comprehensive multi-purpose biological knowledge graph constructed by integrating and refining multiple publicly available datasets. Building upon the Drug Repurposing Knowledge Graph (DRKG), we define a pipeline tasked with a) cleaning inconsistencies and redundancies present in DRKG, b) coalescing information from the main available public data sources, and c) enriching the graph nodes with expressive feature vectors such as molecular fingerprints and gene ontologies. Biologically and chemically relevant features improve the capacity of machine learning models to generate accurate and well-structured embedding spaces. The resulting resource represents a coherent and reliable biological knowledge graph that serves as a state-of-the-art platform to advance research in computational biology and precision medicine. Moreover, it offers the opportunity to benchmark graph-based machine learning and network medicine models on relevant tasks. We demonstrate the effectiveness of the proposed dataset by benchmarking it against the task of drug repurposing, PPI prediction, and side-effect prediction, modeled as link prediction problems.
Problem

Research questions and friction points this paper is trying to address.

Constructing a high-quality biological knowledge graph for computational biology
Integrating and refining diverse public datasets to improve data reliability
Enhancing machine learning models for drug repurposing and disease prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrating and refining multiple public biological datasets
Enriching graph nodes with expressive feature vectors
Benchmarking graph-based models on key biomedical tasks
🔎 Similar Papers
No similar papers found.