BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models

📅 2025-05-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Molecular property prediction models exhibit poor out-of-distribution (OOD) generalization and lack standardized, systematic evaluation benchmarks. To address this, we introduce BOOM—the first standardized molecular OOD benchmark—comprising over 140 model–task combinations across diverse chemical domains. Our unified evaluation framework rigorously assesses graph neural networks, chemically pretrained models, transfer learning approaches, and multiple molecular representations (e.g., SMILES, Graphormer), while analyzing the impact of data generation protocols, pretraining strategies, and model architectures on OOD performance. Key findings reveal that OOD errors average three times higher than in-distribution (ID) errors; models with strong inductive biases excel on simple properties, whereas state-of-the-art chemical foundation models consistently fail at OOD extrapolation. We open-source the BOOM platform to enable reproducible benchmarking and establish robust OOD generalization as a critical frontier challenge for AI-driven chemistry.

Technology Category

Application Category

📝 Abstract

Advances in deep learning and generative modeling have driven interest in data-driven molecule discovery pipelines, whereby machine learning (ML) models are used to filter and design novel molecules without requiring prohibitively expensive first-principles simulations. Although the discovery of novel molecules that extend the boundaries of known chemistry requires accurate out-of-distribution (OOD) predictions, ML models often struggle to generalize OOD. Furthermore, there are currently no systematic benchmarks for molecular OOD prediction tasks. We present BOOM, $oldsymbol{b}$enchmarks for $oldsymbol{o}$ut-$oldsymbol{o}$f-distribution $oldsymbol{m}$olecular property predictions -- a benchmark study of property-based out-of-distribution models for common molecular property prediction models. We evaluate more than 140 combinations of models and property prediction tasks to benchmark deep learning models on their OOD performance. Overall, we do not find any existing models that achieve strong OOD generalization across all tasks: even the top performing model exhibited an average OOD error 3x larger than in-distribution. We find that deep learning models with high inductive bias can perform well on OOD tasks with simple, specific properties. Although chemical foundation models with transfer and in-context learning offer a promising solution for limited training data scenarios, we find that current foundation models do not show strong OOD extrapolation capabilities. We perform extensive ablation experiments to highlight how OOD performance is impacted by data generation, pre-training, hyperparameter optimization, model architecture, and molecular representation. We propose that developing ML models with strong OOD generalization is a new frontier challenge in chemical ML model development. This open-source benchmark will be made available on Github.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking OOD molecular property predictions for ML models

Evaluating 140+ model-task combinations for OOD performance

Assessing chemical foundation models' OOD extrapolation capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking OOD molecular property prediction models

Evaluating 140+ model-task combinations for OOD performance

Analyzing impact of data generation and model architecture

🔎 Similar Papers

No similar papers found.

Authors to Follow