🤖 AI Summary
This work addresses the challenge that existing methods for generating molecular analogs struggle to simultaneously achieve controllable local edits and structural diversity, thereby failing to emulate the intuitive design strategies employed by medicinal chemists. To bridge this gap, the authors propose a novel variable-to-variable generative framework that integrates retrieval-augmented generation (RAG) with a foundation model, trained on large-scale matched molecular pair transformations (MMPT). The approach incorporates a prompting mechanism that enables users to specify desired transformation patterns and guide generation using reference molecules. Evaluated on both general chemical libraries and patent-derived data, the method significantly enhances the diversity, novelty, controllability, and chemical validity of generated analogs, effectively recapitulating rational analog design practices observed in real-world drug discovery campaigns.
📝 Abstract
Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.