Benchmarking GNN Models on Molecular Regression Tasks with CKA-Based Representation Analysis

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the effectiveness of graph neural networks (GNNs) on small-molecule regression tasks and investigates the inductive biases inherent in different architectures—namely GCN, GraphSAGE, GIN, and GAT. The authors propose a hierarchical fusion strategy that integrates GNNs with molecular fingerprints, achieving a significant performance gain: the average RMSE is reduced by over 7% compared to using GNNs alone. For the first time, centered kernel alignment (CKA) is employed to analyze representation spaces, revealing that GNN-derived and fingerprint-based representations are highly independent in latent space (CKA ≤ 0.46), whereas representations from different GNN architectures exhibit strong convergence (CKA ≥ 0.88). These findings highlight both the shared learning mechanisms across GNN variants and their complementary potential when combined with traditional molecular descriptors.

Technology Category

Application Category

📝 Abstract
Molecules are commonly represented as SMILES strings, which can be readily converted to fixed-size molecular fingerprints. These fingerprints serve as feature vectors to train ML/DL models for molecular property prediction tasks in the field of computational chemistry, drug discovery, biochemistry, and materials science. Recent research has demonstrated that SMILES can be used to construct molecular graphs where atoms are nodes ($V$) and bonds are edges ($E$). These graphs can subsequently be used to train geometric DL models like GNN. GNN learns the inherent structural relationships within a molecule rather than depending on fixed-size fingerprints. Although GNN are powerful aggregators, their efficacy on smaller datasets and inductive biases across different architectures is less studied. In our present study, we performed a systematic benchmarking of four different GNN architectures across a diverse domain of datasets (physical chemistry, biological, and analytical). Additionally, we have also implemented a hierarchical fusion (GNN+FP) framework for target prediction. We observed that the fusion framework consistently outperforms or matches the performance of standalone GNN (RMSE improvement>$7\%$) and baseline models. Further, we investigated the representational similarity using centered kernel alignment (CKA) between GNN and fingerprint embeddings and found that they occupy highly independent latent spaces (CKA $\le0.46$). The cross-architectural CKA score suggests a high convergence between isotopic models like GCN, GraphSAGE and GIN (CKA $\geq0.88$), with GAT learning moderately independent representation (CKA $0.55-0.80$).
Problem

Research questions and friction points this paper is trying to address.

Graph Neural Networks
Molecular Regression
Representation Analysis
SMILES
Fingerprint Embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Neural Networks
Molecular Representation
Fingerprint Fusion
Centered Kernel Alignment
Representation Similarity
🔎 Similar Papers
No similar papers found.
R
Rajan
School of Interdisciplinary Research, Indian Institute Of Technology Delhi, New Delhi 110016, India
Ishaan Gupta
Ishaan Gupta
Assistant Professor, Indian Institute of Technology Delhi, India
Functional Genomics