🤖 AI Summary
This work addresses the unstable out-of-distribution (OOD) generalization of molecular graph neural networks in quantitative structure–activity relationship (QSAR) tasks. The authors propose a pretraining strategy based on extended-connectivity fingerprints (ECFP), which enhances the representational capacity of graph neural networks by predicting ECFP substructure information. The approach is systematically evaluated on both in-distribution and OOD QSAR benchmarks. Experiments on six Biogen datasets demonstrate that the method significantly outperforms existing baselines on five datasets, yielding notable improvements in OOD generalization; however, its benefits are limited in tasks involving highly heterogeneous compounds or complex endpoints. The study also uncovers how substructure-level data leakage influences pretraining efficacy, providing insights into the underlying mechanisms affecting performance gains.
📝 Abstract
Molecular Graph Neural Networks (GNNs) are increasingly common in drug discovery, particularly for Quantitative Structure-Activity Relationship (QSAR) studies; yet, their superiority compared to classical molecular featurisation approaches is disputed. We report a general strategy for improving GNNs for QSAR by pre-training to predict Extended-Connectivity Fingerprints (ECFP). We validate our approach with statistical tests and challenging out-of-distribution (OOD) splits. Across five out of six Biogen benchmarks, we observed a statistically significant improvement in standard performance metrics over all evaluated baselines when using ECFP pre-trained GNNs. However, for more heterogeneous datasets and more complex endpoints, such as binding affinity prediction, pre-trained GNNs underperformed in OOD settings. Importantly, we investigated the impact of substructure-level data leakage during pre-training on downstream performance. While we identified scenarios where pre-training on ECFPs was less effective, our findings show that ECFP-based pre-training can enhance downstream OOD performance on a diverse set of practically relevant QSAR tasks.