🤖 AI Summary
Molecular representation learning is hindered by scarce labeled data and insufficient exploitation of 3D geometric information in existing self-supervised methods—most rely solely on 2D topology or hand-crafted augmentations. To address this, we propose C-FREE, the first contrastive-free framework that jointly models 2D graph structure and 3D conformational ensembles via fixed-radius ego-nets, using subgraph embedding prediction as a pretext task—requiring no negative samples, positional encodings, or complex augmentations. Built upon a GNN-Transformer hybrid backbone, C-FREE unifies geometric and topological multimodal information into a single representation. Pretrained on GEOM, C-FREE achieves state-of-the-art performance across multiple MoleculeNet property prediction benchmarks, significantly outperforming leading contrastive, generative, and multimodal approaches. Moreover, it demonstrates superior cross-dataset transferability, validating its robust generalization capability.
📝 Abstract
High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce. While self-supervised pretraining on molecular graphs has shown promise, many existing approaches either depend on hand-crafted augmentations or complex generative objectives, and often rely solely on 2D topology, leaving valuable 3D structural information underutilized. To address this gap, we introduce C-FREE (Contrast-Free Representation learning on Ego-nets), a simple framework that integrates 2D graphs with ensembles of 3D conformers. C-FREE learns molecular representations by predicting subgraph embeddings from their complementary neighborhoods in the latent space, using fixed-radius ego-nets as modeling units across different conformers. This design allows us to integrate both geometric and topological information within a hybrid Graph Neural Network (GNN)-Transformer backbone, without negatives, positional encodings, or expensive pre-processing. Pretraining on the GEOM dataset, which provides rich 3D conformational diversity, C-FREE achieves state-of-the-art results on MoleculeNet, surpassing contrastive, generative, and other multimodal self-supervised methods. Fine-tuning across datasets with diverse sizes and molecule types further demonstrates that pretraining transfers effectively to new chemical domains, highlighting the importance of 3D-informed molecular representations.