DiffER: Categorical Diffusion for Chemical Retrosynthesis

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the low sequence generation efficiency and error accumulation inherent in autoregressive models for template-free retrosynthetic prediction, this paper proposes the first categorical diffusion framework tailored to SMILES discrete sequences. Methodologically: (1) it introduces an end-to-end joint reactant generation mechanism grounded in a diffusion process, eliminating autoregressive dependencies; (2) it incorporates a variance-aware sequence length prediction module to explicitly model uncertainty in SMILES length; and (3) it approximates the posterior distribution via multi-model ensemble to enhance sampling confidence and structural diversity. Experiments demonstrate state-of-the-art top-1 accuracy under the template-free setting, with highly competitive top-3/5/10 accuracy. Ablation studies confirm that length prediction accuracy significantly influences overall performance.

Technology Category

Application Category

📝 Abstract

Methods for automatic chemical retrosynthesis have found recent success through the application of models traditionally built for natural language processing, primarily through transformer neural networks. These models have demonstrated significant ability to translate between the SMILES encodings of chemical products and reactants, but are constrained as a result of their autoregressive nature. We propose DiffER, an alternative template-free method for retrosynthesis prediction in the form of categorical diffusion, which allows the entire output SMILES sequence to be predicted in unison. We construct an ensemble of diffusion models which achieves state-of-the-art performance for top-1 accuracy and competitive performance for top-3, top-5, and top-10 accuracy among template-free methods. We prove that DiffER is a strong baseline for a new class of template-free model, capable of learning a variety of synthetic techniques used in laboratory settings and outperforming a variety of other template-free methods on top-k accuracy metrics. By constructing an ensemble of categorical diffusion models with a novel length prediction component with variance, our method is able to approximately sample from the posterior distribution of reactants, producing results with strong metrics of confidence and likelihood. Furthermore, our analyses demonstrate that accurate prediction of the SMILES sequence length is key to further boosting the performance of categorical diffusion models.

Problem

Research questions and friction points this paper is trying to address.

Proposing DiffER for template-free chemical retrosynthesis prediction

Overcoming autoregressive constraints in SMILES sequence prediction

Achieving state-of-the-art accuracy with categorical diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Categorical diffusion for retrosynthesis prediction

Ensemble of diffusion models boosts accuracy

Novel length prediction enhances SMILES generation

🔎 Similar Papers

No similar papers found.