On Improving Graph Neural Networks for QSAR by Pre-training on Extended-Connectivity Fingerprints

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the unstable out-of-distribution (OOD) generalization of molecular graph neural networks in quantitative structure–activity relationship (QSAR) tasks. The authors propose a pretraining strategy based on extended-connectivity fingerprints (ECFP), which enhances the representational capacity of graph neural networks by predicting ECFP substructure information. The approach is systematically evaluated on both in-distribution and OOD QSAR benchmarks. Experiments on six Biogen datasets demonstrate that the method significantly outperforms existing baselines on five datasets, yielding notable improvements in OOD generalization; however, its benefits are limited in tasks involving highly heterogeneous compounds or complex endpoints. The study also uncovers how substructure-level data leakage influences pretraining efficacy, providing insights into the underlying mechanisms affecting performance gains.
📝 Abstract
Molecular Graph Neural Networks (GNNs) are increasingly common in drug discovery, particularly for Quantitative Structure-Activity Relationship (QSAR) studies; yet, their superiority compared to classical molecular featurisation approaches is disputed. We report a general strategy for improving GNNs for QSAR by pre-training to predict Extended-Connectivity Fingerprints (ECFP). We validate our approach with statistical tests and challenging out-of-distribution (OOD) splits. Across five out of six Biogen benchmarks, we observed a statistically significant improvement in standard performance metrics over all evaluated baselines when using ECFP pre-trained GNNs. However, for more heterogeneous datasets and more complex endpoints, such as binding affinity prediction, pre-trained GNNs underperformed in OOD settings. Importantly, we investigated the impact of substructure-level data leakage during pre-training on downstream performance. While we identified scenarios where pre-training on ECFPs was less effective, our findings show that ECFP-based pre-training can enhance downstream OOD performance on a diverse set of practically relevant QSAR tasks.
Problem

Research questions and friction points this paper is trying to address.

Graph Neural Networks
QSAR
Out-of-Distribution
Molecular Representation
Pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Neural Networks
Pre-training
Extended-Connectivity Fingerprints
QSAR
Out-of-Distribution Generalization
S
Sam Money-Kyrle
Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford, OX1 3LB, United Kingdom.
M
Markus Dablander
Mathematical Institute, University of Oxford, Andrew Wiles Building, Woodstock Road, Oxford, OX2 6GG, United Kingdom.
T
Thierry Hanser
Molecular Informatics and AI, Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds, LS11 5PS, United Kingdom.
S
Stephane Werner
Molecular Informatics and AI, Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds, LS11 5PS, United Kingdom.
C
Charlotte M. Deane
Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford, OX1 3LB, United Kingdom.
Garrett M. Morris
Garrett M. Morris
University of Oxford
Computational ChemistryComputer-Aided Drug DesignVirtual ScreeningDockingMachine Learning & AI