BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

📅 2025-05-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Molecular property prediction models exhibit poor out-of-distribution (OOD) generalization and lack standardized, systematic evaluation benchmarks. To address this, we introduce BOOM—the first standardized molecular OOD benchmark—comprising over 140 model–task combinations across diverse chemical domains. Our unified evaluation framework rigorously assesses graph neural networks, chemically pretrained models, transfer learning approaches, and multiple molecular representations (e.g., SMILES, Graphormer), while analyzing the impact of data generation protocols, pretraining strategies, and model architectures on OOD performance. Key findings reveal that OOD errors average three times higher than in-distribution (ID) errors; models with strong inductive biases excel on simple properties, whereas state-of-the-art chemical foundation models consistently fail at OOD extrapolation. We open-source the BOOM platform to enable reproducible benchmarking and establish robust OOD generalization as a critical frontier challenge for AI-driven chemistry.

Technology Category

Application Category

📝 Abstract
Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby machine learning (ML) models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OOD) predictions, ML models often struggle to generalize OOD. Furthermore, there are currently no systematic benchmarks for molecular OOD prediction tasks. We present BOOM, $oldsymbol{b}$enchmarks for $oldsymbol{o}$ut-$oldsymbol{o}$f-distribution $oldsymbol{m}$olecular property predictions -- a benchmark study of property-based out-of-distribution models for common molecular property prediction models. We evaluate more than 140 combinations of models and property prediction tasks to benchmark deep learning models on their OOD performance. Overall, we do not find any existing models that achieve strong OOD generalization across all tasks: even the top performing model exhibited an average OOD error 3x larger than in-distribution. We find that deep learning models with high inductive bias can perform well on OOD tasks with simple, specific properties. Although chemical foundation models with transfer and in-context learning offer a promising solution for limited training data scenarios, we find that current foundation models do not show strong OOD extrapolation capabilities. We perform extensive ablation experiments to highlight how OOD performance is impacted by data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation. We propose that developing ML models with strong OOD generalization is a new frontier challenge in chemical ML model development. This open-source benchmark will be made available on Github.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking OOD molecular property predictions for ML models
Evaluating 140+ model-task combinations for OOD performance
Assessing chemical foundation models' OOD extrapolation capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking OOD molecular property prediction models
Evaluating 140+ model-task combinations for OOD performance
Analyzing impact of data generation and model architecture
🔎 Similar Papers
No similar papers found.
E
Evan R. Antoniuk
Lawrence Livermore National Laboratory
S
Shehtab Zaman
Binghamton University, School of Computing
Tal Ben-Nun
Tal Ben-Nun
Lawrence Livermore National Laboratory
High Performance ComputingParallel and Distributed AlgorithmsProgramming ModelsMachine Learning
P
Peggy Li
Lawrence Livermore National Laboratory
James Diffenderfer
James Diffenderfer
University of Florida, Lawrence Livermore National Laboratory
OptimizationMachine Learning
B
Busra Demirci
Binghamton University, School of Computing
O
Obadiah Smolenski
Binghamton University, School of Computing
Tim Hsu
Tim Hsu
Lawrence Livermore National Laboratory
A
A. Hiszpanski
Lawrence Livermore National Laboratory
K
Kenneth Chiu
Binghamton University, School of Computing
B
B. Kailkhura
Lawrence Livermore National Laboratory
B
B. V. Essen
Lawrence Livermore National Laboratory