Descriptor-based Foundation Models for Molecular Property Prediction

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
To address substantial experimental noise in molecular property data and systematic biases inherent in quantum simulations, this paper introduces CheMeleon, a novel molecular foundation model. CheMeleon pioneers a descriptor-driven, noise-robust pretraining paradigm that jointly learns robust and generalizable molecular representations using Mordred descriptors and a directed message-passing neural network (D-MPNN). Under few-shot settings, it achieves significant performance gains: 79% win rate on the Polaris benchmark—43 percentage points higher than Chemprop—and 97% on MoleculeACE. t-SNE visualizations confirm clear separation of chemically coherent structural series. This work establishes a new paradigm for high-accuracy, low-data-dependency molecular modeling. However, accurately capturing subtle structure–activity relationships—such as activity cliffs—remains challenging.

Technology Category

Application Category

📝 Abstract
Fast and accurate prediction of molecular properties with machine learning is pivotal to scientific advancements across myriad domains. Foundation models in particular have proven especially effective, enabling accurate training on small, real-world datasets. This study introduces CheMeleon, a novel molecular foundation model pre-trained on deterministic molecular descriptors from the Mordred package, leveraging a Directed Message-Passing Neural Network to predict these descriptors in a noise-free setting. Unlike conventional approaches relying on noisy experimental data or biased quantum mechanical simulations, CheMeleon uses low-noise molecular descriptors to learn rich molecular representations. Evaluated on 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a win rate of 79% on Polaris tasks, outperforming baselines like Random Forest (46%), fastprop (39%), and Chemprop (36%), and a 97% win rate on MoleculeACE assays, surpassing Random Forest (63%) and other foundation models. However, it struggles to distinguish activity cliffs like many of the tested models. The t-SNE projection of CheMeleon's learned representations demonstrates effective separation of chemical series, highlighting its ability to capture structural nuances. These results underscore the potential of descriptor-based pre-training for scalable and effective molecular property prediction, opening avenues for further exploration of descriptor sets and unlabeled datasets.
Problem

Research questions and friction points this paper is trying to address.

Predict molecular properties accurately with low-noise descriptors
Overcome limitations of noisy experimental or biased simulation data
Improve foundation models for scalable molecular representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained on deterministic Mordred molecular descriptors
Uses Directed Message-Passing Neural Network
Achieves high win rates on benchmark datasets
💼 Related Jobs
Postdoctoral Fellow – AI-Driven Multi-Omics Integration for Predictive Toxicology
Pfizer
The annual base salary for this position ranges from $64,600.00 to $107,600.00. In addition, this position is eligible for participation in Pfizer’s Global Performance Plan with a bonus target of 7.5% of the base salary. We offer comprehensive and generous benefits and programs to help our colleagues lead healthy lives and to support each of life’s moments. Benefits offered include a 401(k) plan with Pfizer Matching Contributions and an additional Pfizer Retirement Savings Contribution, paid vacation, holiday and personal days, paid caregiver/parental and medical leave, and health benefits to include medical, prescription drug, dental and vision coverage. Learn more at Pfizer Candidate Site – U.S. Benefits | (uscandidates.mypfizerbenefits.com). Pfizer compensation structures and benefit packages are aligned based on the location of hire. The United States salary range provided does not apply to Tampa, FL or any location outside of the United States. Relocation assistance may be available based on business needs and/or eligibility.
Hybrid
J
Jackson Burns
Department of Chemical Engineering, MIT, Cambridge, MA.
A
Akshat Zalte
Department of Chemical Engineering, MIT, Cambridge, MA.
William Green
William Green
Professor of Mathematics, Rose-Hulman Institute of Technology
Dispersive PDEs