Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
This study addresses the vulnerability of graph self-supervised learning (GSSL) to noise inherent in real-world, text-derived biomedical knowledge graphs—a challenge overlooked by existing research that predominantly evaluates methods on clean or synthetic graphs. To bridge this gap, the authors introduce the first benchmarking framework tailored to realistic noisy settings, systematically comparing model performance on MedMentions (a noisy graph) against UMLS (a curated, clean graph). Through comprehensive analysis of pretraining tasks and GNN architectures, they find that feature reconstruction exhibits robustness under noise and that bidirectional message passing consistently outperforms unidirectional variants. The work further proposes NATD-GSSL, a unified pipeline integrating graph construction, refinement, and representation learning, which achieves up to a 7% improvement over language model baselines. Code and benchmark datasets are publicly released.
📝 Abstract
Graph Self-Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large-scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real-world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text-driven graphs for unsupervised term typing. We introduce Noise-Aware Text-Driven Graph GSSL (NATD-GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual-graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well-defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean-graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message-passing designs are better suited to noisy, text-driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD-GSSL provides practical guidance for applying GSSL to real-world, noisy graphs and achieves up to a 7\% improvement over pretrained language model baselines. All code and benchmarks are publicly available at https://github.com/OthmaneKabal/MC2GAE.
Problem

Research questions and friction points this paper is trying to address.

Graph Self-Supervised Learning
Real-World Noise
Text-Driven Graphs
Robustness
Knowledge Graph
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Self-Supervised Learning
Real-World Noise
Text-Driven Knowledge Graphs
Robustness Evaluation
Noise-Aware GSSL