🤖 AI Summary
To address the practical bottleneck of scarce experimental labels (≤50 samples per task) in molecular property prediction, this paper proposes MoleVers, a two-stage self-supervised–weakly-supervised pretraining framework. Methodologically, the first stage performs large-scale unsupervised pretraining via dynamic atomic masking and denoising. The second stage introduces low-cost computational auxiliary labels—e.g., DFT-predicted values—as weak supervision signals and employs a branched encoder for knowledge distillation. Contributions include: (1) the first dynamic denoising pretraining task coupled with a computational-label-guided weakly supervised paradigm; and (2) the construction of MolBench, the first benchmark covering 22 real-world few-shot molecular datasets. On MolBench, MoleVers achieves state-of-the-art performance on 20 datasets and ranks second on the remaining two, significantly outperforming existing few-shot molecular modeling approaches.
📝 Abstract
Accurate property prediction is crucial for accelerating the discovery of new molecules. Although deep learning models have achieved remarkable success, their performance often relies on large amounts of labeled data that are expensive and time-consuming to obtain. Thus, there is a growing need for models that can perform well with limited experimentally-validated data. In this work, we introduce MoleVers, a versatile pretrained model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated molecular property labels are scarce. MoleVers adopts a two-stage pretraining strategy. In the first stage, the model learns molecular representations from large unlabeled datasets via masked atom prediction and dynamic denoising, a novel task enabled by a new branching encoder architecture. In the second stage, MoleVers is further pretrained using auxiliary labels obtained with inexpensive computational methods, enabling supervised learning without the need for costly experimental data. This two-stage framework allows MoleVers to learn representations that generalize effectively across various downstream datasets. We evaluate MoleVers on a new benchmark comprising 22 molecular datasets with diverse types of properties, the majority of which contain 50 or fewer training labels reflecting real-world conditions. MoleVers achieves state-of-the-art results on 20 out of the 22 datasets, and ranks second among the remaining two, highlighting its ability to bridge the gap between data-hungry models and real-world conditions where practically-useful labels are scarce.