🤖 AI Summary
Current drug discovery models are typically confined to isolated stages, failing to jointly address molecular generation, screening, and optimization—thereby limiting both development efficiency and chemical diversity. To overcome this limitation, we propose the first general-purpose molecular generation model designed for end-to-end drug discovery, supporting fragment-constrained generation, de novo design, hit compound generation, and lead optimization within a unified framework. Methodologically, we introduce a novel non-autoregressive, bidirectional parallel decoding architecture grounded in discrete diffusion, coupled with a fragment remasking strategy that enables controllable, fragment-level optimization and efficient exploration of chemical space. By integrating SAFE molecular representation with discrete diffusion dynamics, our model significantly outperforms GPT-based baselines on de novo and fragment-constrained generation tasks, and achieves state-of-the-art performance in objective-driven hit generation and lead optimization.
📝 Abstract
Drug discovery is a complex process that involves multiple scenarios and stages, such as fragment-constrained molecule generation, hit generation and lead optimization. However, existing molecular generative models can only tackle one or two of these scenarios and lack the flexibility to address various aspects of the drug discovery pipeline. In this paper, we present Generalist Molecular generative model (GenMol), a versatile framework that addresses these limitations by applying discrete diffusion to the Sequential Attachment-based Fragment Embedding (SAFE) molecular representation. GenMol generates SAFE sequences through non-autoregressive bidirectional parallel decoding, thereby allowing utilization of a molecular context that does not rely on the specific token ordering and enhanced computational efficiency. Moreover, under the discrete diffusion framework, we introduce fragment remasking, a strategy that optimizes molecules by replacing fragments with masked tokens and regenerating them, enabling effective exploration of chemical space. GenMol significantly outperforms the previous GPT-based model trained on SAFE representations in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These experimental results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design.