A Semi-supervised Molecular Learning Framework for Activity Cliff Estimation

📅 2024-08-01

🏛️ International Joint Conference on Artificial Intelligence

📈 Citations: 5

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of activity cliffs in molecular property prediction, where existing models suffer significant performance degradation under limited labeled data. To this end, we propose SemiMol—the first semi-supervised learning framework tailored for molecular regression tasks—that innovatively integrates a reliability-aware pseudo-labeling mechanism with adaptive curriculum learning. Leveraging a teacher–student architecture, SemiMol generates and evaluates the credibility of pseudo-labels for unlabeled molecules, dynamically modulating training difficulty to optimize graph neural network learning. Our approach overcomes the conventional limitation of pseudo-labeling methods that rely on classification probability outputs and are thus ill-suited for regression settings. Extensive experiments demonstrate that SemiMol substantially outperforms state-of-the-art pretraining and semi-supervised baselines across 30 activity cliff datasets.

Technology Category

Application Category

📝 Abstract

Machine learning (ML) enables accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Their success is based on the principle of similarity at its heart, assuming that similar molecules exhibit close properties. However, activity cliffs challenge this principle, and their presence leads to a sharp decline in the performance of existing ML algorithms, particularly graph-based methods. To overcome this obstacle under a low-data scenario, we propose a novel semi-supervised learning (SSL) method dubbed SemiMol, which employs predictions on numerous unannotated data as pseudo-signals for subsequent training. Specifically, we introduce an additional instructor model to evaluate the accuracy and trustworthiness of proxy labels because existing pseudo-labeling approaches require probabilistic outputs to reveal the model's confidence and fail to be applied in regression tasks. Moreover, we design a self-adaptive curriculum learning algorithm to progressively move the target model toward hard samples at a controllable pace. Extensive experiments on 30 activity cliff datasets demonstrate that SemiMol significantly enhances graph-based ML architectures and outpasses state-of-the-art pretraining and SSL baselines.

Problem

Research questions and friction points this paper is trying to address.

activity cliff

molecular property prediction

similarity principle

low-data scenario

graph-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

semi-supervised learning

activity cliff

pseudo-labeling

instructor model

curriculum learning

🔎 Similar Papers

No similar papers found.

Authors to Follow