CheMixHub: Datasets and Benchmarks for Chemical Mixture Property Prediction

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The lack of systematic benchmarks and standardized evaluation protocols hinders rigorous assessment of machine learning models for predicting properties of chemical mixtures. Method: This work introduces CheMixHub—the first comprehensive benchmark platform for mixture property prediction—encompassing 11 real-world industrial tasks (e.g., drug delivery, electrolyte design) and ~500K high-quality data points. It proposes a novel three-level data partitioning scheme—by component, formulation, and task—to rigorously evaluate contextual generalization. A unified baseline framework is developed, integrating graph neural networks with set-based representations, alongside a standardized evaluation protocol. Contribution/Results: CheMixHub fills a critical gap in ML-driven modeling of multi-molecular systems, significantly improving model comparability and robustness assessment. All data and code are publicly released to foster community-wide standardization and reproducible research.

Technology Category

Application Category

📝 Abstract
Developing improved predictive models for multi-molecular systems is crucial, as nearly every chemical product used results from a mixture of chemicals. While being a vital part of the industry pipeline, the chemical mixture space remains relatively unexplored by the Machine Learning community. In this paper, we introduce CheMixHub, a holistic benchmark for molecular mixtures, covering a corpus of 11 chemical mixtures property prediction tasks, from drug delivery formulations to battery electrolytes, totalling approximately 500k data points gathered and curated from 7 publicly available datasets. CheMixHub introduces various data splitting techniques to assess context-specific generalization and model robustness, providing a foundation for the development of predictive models for chemical mixture properties. Furthermore, we map out the modelling space of deep learning models for chemical mixtures, establishing initial benchmarks for the community. This dataset has the potential to accelerate chemical mixture development, encompassing reformulation, optimization, and discovery. The dataset and code for the benchmarks can be found at: https://github.com/chemcognition-lab/chemixhub
Problem

Research questions and friction points this paper is trying to address.

Predict properties of chemical mixtures using ML models
Address lack of benchmarks for multi-molecular system prediction
Enable reformulation and discovery in industrial chemical applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CheMixHub benchmark for molecular mixtures
Includes 500k data points from 7 datasets
Assesses model robustness with data splitting
🔎 Similar Papers
No similar papers found.
E
Ella Miray Rajaonson
University of Toronto, Canada; Vector Institute for Artificial Intelligence, Canada
M
Mahyar Rajabi Kochi
University of Toronto, Canada
L
Luis Martin Mejia Mendoza
Clean Energy Innovation Research Center, National Research Council, Canada
S
Seyed Mohamad Moosavi
University of Toronto, Canada; Vector Institute for Artificial Intelligence, Canada
Benjamin Sanchez-Lengeling
Benjamin Sanchez-Lengeling
Assistant Professor at University of Toronto
Computational ChemistryMachine LearningMaterialsGenerative modelsMaking sense of models