ChatMol: A Versatile Molecule Designer Based on the Numerically Enhanced Large Language Model

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Target-guided de novo molecular design—generating high-quality molecules under single/multi-property and substructure constraints—remains challenging. Method: This paper introduces the first numerically enhanced large language model (LLM) framework tailored for molecular design, integrating: (1) LLM-adapted molecular representations (SMILES/SELFIES); (2) constraint-oriented multi-task prompting; and (3) attribute-prediction–guided fine-tuning with numerical positional encoding. Crucially, it avoids dedicated predictors or reinforcement learning. Contribution/Results: On the ESR1 multi-objective binding affinity optimization task, our method achieves K<sub>D</sub> = 0.25, outperforming state-of-the-art by 4.76%. Generated molecules exhibit a Pearson correlation of +0.49 with numerical target instructions, demonstrating precise controllability. The framework significantly advances controllable, interpretable, and efficient molecular generation.

Technology Category

Application Category

📝 Abstract
Goal-oriented de novo molecule design, namely generating molecules with specific property or substructure constraints, is a crucial yet challenging task in drug discovery. Existing methods, such as Bayesian optimization and reinforcement learning, often require training multiple property predictors and struggle to incorporate substructure constraints. Inspired by the success of Large Language Models (LLMs) in text generation, we propose ChatMol, a novel approach that leverages LLMs for molecule design across diverse constraint settings. Initially, we crafted a molecule representation compatible with LLMs and validated its efficacy across multiple online LLMs. Afterwards, we developed specific prompts geared towards diverse constrained molecule generation tasks to further fine-tune current LLMs while integrating feedback learning derived from property prediction. Finally, to address the limitations of LLMs in numerical recognition, we referred to the position encoding method and incorporated additional encoding for numerical values within the prompt. Experimental results across single-property, substructure-property, and multi-property constrained tasks demonstrate that ChatMol consistently outperforms state-of-the-art baselines, including VAE and RL-based methods. Notably, in multi-objective binding affinity maximization task, ChatMol achieves a significantly lower KD value of 0.25 for the protein target ESR1, while maintaining the highest overall performance, surpassing previous methods by 4.76%. Meanwhile, with numerical enhancement, the Pearson correlation coefficient between the instructed property values and those of the generated molecules increased by up to 0.49. These findings highlight the potential of LLMs as a versatile framework for molecule generation, offering a promising alternative to traditional latent space and RL-based approaches.
Problem

Research questions and friction points this paper is trying to address.

Molecule design with specific property constraints
Incorporating substructure constraints in molecule generation
Enhancing numerical recognition in Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based molecule design
Numerical value encoding
Task-specific prompt fine-tuning
C
Chuanliu Fan
Ziqiang Cao
Ziqiang Cao
Soochow University
Natural Language Processing
Zicheng Ma
Zicheng Ma
Peking University
BiophysicsBioinformaticsDeep learning
N
Nan Yu
Y
Yimin Peng
J
Jun Zhang
Y
Yiqin Gao
G
Guohong Fu