Are we Making Much Progress? Revisiting Chemical Reaction Yield Prediction from an Imbalanced Regression Perspective

📅 2024-02-06

🏛️ The Web Conference

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Yield prediction in chemical reactions suffers from low accuracy for high-yield samples (>80%), a challenge stemming from severe right-skewed data distributions where low-yield instances dominate, systematically hindering modeling of the high-yield regime. Method: This work formally defines reaction yield prediction as an imbalanced regression problem and proposes a cost-sensitive reweighting framework that adjusts loss weights based on predicted yield—without modifying underlying graph neural network architectures or regression heads. Contribution/Results: Evaluated on three real-world yield datasets, the method reduces mean absolute error in the high-yield region by 27%, demonstrating effectiveness, architectural agnosticism, and plug-and-play applicability. This study establishes a novel paradigm for addressing imbalanced regression in molecular science.

Technology Category

Application Category

📝 Abstract

The yield of a chemical reaction quantifies the percentage of the target product formed in relation to the reactants consumed during the chemical reaction. Accurate yield prediction can guide chemists toward selecting high-yield reactions during synthesis planning, offering valuable insights before dedicating time and resources to wet lab experiments. While recent advancements in yield prediction have led to overall performance improvement across the entire yield range, an open challenge remains in enhancing predictions for high-yield reactions, which are of greater concern to chemists. In this paper, we argue that the performance gap in high-yield predictions results from the imbalanced distribution of real-world data skewed towards low-yield reactions, often due to unreacted starting materials and inherent ambiguities in the reaction processes. Despite this data imbalance, existing yield prediction methods continue to treat different yield ranges equally, assuming a balanced training distribution. Through extensive experiments on three real-world yield prediction datasets, we emphasize the urgent need to reframe reaction yield prediction as an imbalanced regression problem. Finally, we demonstrate that incorporating simple cost-sensitive re-weighting methods can significantly enhance the performance of yield prediction models on underrepresented high-yield regions.

Problem

Research questions and friction points this paper is trying to address.

Improving high-yield reaction prediction

Addressing imbalanced data in yield prediction

Enhancing model performance with cost-sensitive methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Imbalanced regression problem reframing

Cost-sensitive re-weighting methods

Enhanced high-yield reaction prediction

🔎 Similar Papers

An Autonomous Large Language Model Agent for Chemical Literature Data Mining