DiffER: Categorical Diffusion for Chemical Retrosynthesis

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low sequence generation efficiency and error accumulation inherent in autoregressive models for template-free retrosynthetic prediction, this paper proposes the first categorical diffusion framework tailored to SMILES discrete sequences. Methodologically: (1) it introduces an end-to-end joint reactant generation mechanism grounded in a diffusion process, eliminating autoregressive dependencies; (2) it incorporates a variance-aware sequence length prediction module to explicitly model uncertainty in SMILES length; and (3) it approximates the posterior distribution via multi-model ensemble to enhance sampling confidence and structural diversity. Experiments demonstrate state-of-the-art top-1 accuracy under the template-free setting, with highly competitive top-3/5/10 accuracy. Ablation studies confirm that length prediction accuracy significantly influences overall performance.

Technology Category

Application Category

📝 Abstract
Methods for automatic chemical retrosynthesis have found recent success through the application of models traditionally built for natural language processing, primarily through transformer neural networks. These models have demonstrated significant ability to translate between the SMILES encodings of chemical products and reactants, but are constrained as a result of their autoregressive nature. We propose DiffER, an alternative template-free method for retrosynthesis prediction in the form of categorical diffusion, which allows the entire output SMILES sequence to be predicted in unison. We construct an ensemble of diffusion models which achieves state-of-the-art performance for top-1 accuracy and competitive performance for top-3, top-5, and top-10 accuracy among template-free methods. We prove that DiffER is a strong baseline for a new class of template-free model, capable of learning a variety of synthetic techniques used in laboratory settings and outperforming a variety of other template-free methods on top-k accuracy metrics. By constructing an ensemble of categorical diffusion models with a novel length prediction component with variance, our method is able to approximately sample from the posterior distribution of reactants, producing results with strong metrics of confidence and likelihood. Furthermore, our analyses demonstrate that accurate prediction of the SMILES sequence length is key to further boosting the performance of categorical diffusion models.
Problem

Research questions and friction points this paper is trying to address.

Proposing DiffER for template-free chemical retrosynthesis prediction
Overcoming autoregressive constraints in SMILES sequence prediction
Achieving state-of-the-art accuracy with categorical diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Categorical diffusion for retrosynthesis prediction
Ensemble of diffusion models boosts accuracy
Novel length prediction enhances SMILES generation
🔎 Similar Papers
No similar papers found.
S
Sean Current
Computer Science and Engineering, The Ohio State University, Columbus, 43210, OH, USA.
Z
Ziqi Chen
Computer Science and Engineering, The Ohio State University, Columbus, 43210, OH, USA.
Daniel Adu-Ampratwum
Daniel Adu-Ampratwum
Research Assistant Professor, Ohio State University
Organic ChemistryNatural Product SynthesisMedicinal ChemistryDrug Discovery.
Xia Ning
Xia Ning
Professor, Biomedical Informatics, Computer Science and Engineering, The Ohio State
GenAIMedical AILLMsDrug Development
S
Srinivasan Parthasarathy
Computer Science and Engineering, The Ohio State University, Columbus, 43210, OH, USA.