ConRetroBert: EMA Stabilized Dual Encoders for Template-Based Single-Step Retrosynthesis

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the limitations of template-based one-step retrosynthesis, which struggles to match the performance of template-free approaches due to global classification modeling over long-tailed reaction rule libraries. To overcome this, the authors propose ConRetroBert, a framework that reformulates template prediction as a two-stage task: dense product-to-template retrieval followed by candidate list ranking. The approach leverages contrastive pretraining to construct a shared embedding space and employs multi-positive listwise ranking to refine template ordering. A key innovation is the integration of exponential moving average (EMA) to stabilize updates of the template encoder, enabling adaptive template-side representation learning and efficient hard negative mining. Experiments demonstrate significant improvements, with Top-1 accuracy on USPTO-50k rising from 50.5% to 62.4%, and further reaching 75.4% after leakage-free fine-tuning on USPTO-Full, particularly excelling in long-tail scenarios involving rare templates.

📝 Abstract

Template based single step retrosynthesis predicts reactants by selecting and applying an explicit reaction template, making each prediction traceable to a chemical transformation rule. This is useful for synthesis planning, but template based methods are often viewed as less competitive than template free models because template prediction is commonly formulated as global classification over a long tailed rule library. We argue that this weakness is not inherent to templates, but to the learning formulation. We present ConRetroBert, a dual encoder framework that reframes template based retrosynthesis as dense product template retrieval followed by candidate set listwise ranking. Stage 1 uses contrastive pretraining to learn a shared embedding space between products and reaction templates. Stage 2 refines template ranking over mined hard negative candidate sets with a multi positive listwise objective. To enable template side adaptation without destabilizing hard negative mining, ConRetroBert uses a slow moving exponential moving average template encoder for retrieval bank construction while updating the live template encoder through the ranking loss. On the local USPTO-50k benchmark, Stage 2 candidate set ranking improves top-1 reaction accuracy from 50.5% to 61.3%, while EMA stabilized template adaptation further improves it to 62.4%. Fine tuning from a leakage controlled USPTO-Full checkpoint reaches 75.4% top-1 accuracy on USPTO-50k. We also show that retrieval based template prediction is strong in the long tail of rare templates, and that many correct reactant predictions arise from alternative explicit templates rather than only the recorded positive label. Code and data are available at https://github.com/JahidBasher/ConRetroBert.

Problem

Research questions and friction points this paper is trying to address.

template-based retrosynthesis

long-tailed rule library

global classification

reaction template prediction

single-step retrosynthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual encoder

contrastive pretraining

listwise ranking