CompleteRXN: Toward Completing Open Chemical Reaction Databases

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This study addresses pervasive data incompleteness in chemical reaction databases such as USPTO—including missing byproducts, omitted co-reactants, and incomplete stoichiometric coefficients—by introducing CompleteRXN, the first large-scale supervised benchmark tailored to realistic missing-data scenarios. CompleteRXN is constructed by aligning raw reaction records with manually curated, atom-balanced reactions to form incomplete–complete reaction pairs. Building upon this benchmark, the authors propose CRB, a constrained-decoding Transformer encoder-decoder model that achieves high-fidelity reaction completion while enforcing chemical plausibility and atom conservation. Experiments demonstrate that CRB attains equivalent accuracies of 99.20% and 91.12% under random and extreme out-of-distribution splits, respectively, substantially outperforming baseline methods. In contrast, using unprocessed USPTO data leads to severe performance degradation across all models, underscoring the critical impact of data quality on downstream reaction prediction tasks.
📝 Abstract
Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split. SynRBL produces many balanced and chemically plausible completions, but with lower accuracy on the benchmark test splits. Across all methods, performance degrades with increasing incompleteness. We observe a substantial drop when evaluating on reactions outside the benchmark (full uncurated USPTO), highlighting the gap between benchmark performance and practical robustness and motivating future work.
Problem

Research questions and friction points this paper is trying to address.

chemical reaction completion
reaction database incompleteness
stoichiometric coefficients
byproducts
co-reactants
Innovation

Methods, ideas, or system contributions that make the work stand out.

reaction completion
atom balancing
constrained decoding
chemical reaction dataset
out-of-distribution generalization