🤖 AI Summary
Current generative models for transition state prediction in large molecules are hindered by distributional shift and the scarcity of large-scale training data. This work proposes a divide-and-conquer strategy that focuses on the reaction core—whose size and composition remain relatively constant—and trains a generative model to predict only its transition state geometry. The full transition state is then reconstructed via fragment assembly and refined using a saddle-point optimization algorithm. This approach effectively mitigates generalization challenges posed by large molecular systems and enables scalable transition state generation. Evaluated on a dataset of reactions involving up to 33 heavy atoms, the method accurately identifies 90% of transition states and reduces the number of optimization steps by 30% compared to conventional initialization approaches.
📝 Abstract
Transition states (TSs) are central to understanding and quantitatively predicting chemical reactivity and reaction mechanisms. Although traditional TS generation methods are computationally expensive, recent generative modeling approaches have enabled chemically meaningful TS prediction for relatively small molecules. However, these methods fail to generalize to practically relevant reaction substrates because of distribution shifts induced by increasing molecular sizes. Furthermore, TS geometries for larger molecules are not available at scale, making it infeasible to train generative models from scratch on such molecules. To address these challenges, we introduce FragmentFlow: a divide-and-conquer approach that trains a generative model to predict TS geometries for the reactive core atoms, which define the reaction mechanism. The full TS structure is then reconstructed by re-attaching substituent fragments to the predicted core. By operating on reactive cores, whose size and composition remain relatively invariant across molecular contexts, FragmentFlow mitigates distribution shifts in generative modeling. Evaluated on a new curated dataset of reactions involving reactants with up to 33 heavy atoms, FragmentFlow correctly identifies 90% of TSs while requiring 30% fewer saddle-point optimization steps than classical initialization schemes. These results point toward scalable TS generation for high-throughput reactivity studies.