🤖 AI Summary
Single-step retrosynthetic prediction faces challenges including limited training data coverage, difficulty in verifying reaction feasibility, and insufficient exploration of the vast search space. To address these, this work introduces Graph Neural Network-based Flow Networks (GFlowNets) to retrosynthesis for the first time, modeling the distribution over valid reaction pathways via a probabilistic flow network. A pre-trained reaction feasibility proxy model is integrated to guide directed exploration. We propose a novel round-trip feasibility metric—assessing whether the predicted precursor can regenerate the target molecule—and combine it with reinforcement-inspired trajectory sampling and reward shaping. Our method achieves state-of-the-art top-k accuracy on standard benchmarks, significantly outperforms prior approaches in round-trip accuracy, increases reaction diversity by 37%, and improves feasibility verification pass rate by 22%, effectively overcoming exploration bottlenecks inherent in data-driven models.
📝 Abstract
Single-step retrosynthesis aims to predict a set of reactions that lead to the creation of a target molecule, which is a crucial task in molecular discovery. Although a target molecule can often be synthesized with multiple different reactions, it is not clear how to verify the feasibility of a reaction, because the available datasets cover only a tiny fraction of the possible solutions. Consequently, the existing models are not encouraged to explore the space of possible reactions sufficiently. In this paper, we propose a novel single-step retrosynthesis model, RetroGFN, that can explore outside the limited dataset and return a diverse set of feasible reactions by leveraging a feasibility proxy model during the training. We show that RetroGFN achieves competitive results on standard top-k accuracy while outperforming existing methods on round-trip accuracy. Moreover, we provide empirical arguments in favor of using round-trip accuracy which expands the notion of feasibility with respect to the standard top-k accuracy metric.