How to Make Large Language Models Generate 100% Valid Molecules?

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Large language models (LLMs) exhibit low efficacy in molecular generation under few-shot settings, frequently producing syntactically invalid SMILES strings. Method: We propose SmiSelf, a cross-chemical-language framework that leverages formal grammar rules to automatically map invalid SMILES to syntactically valid SELFIES representations while preserving molecular semantics. Contribution/Results: SmiSelf achieves the first LLM-based generation of 100% syntactically valid and chemically reasonable molecules. Experiments demonstrate that it maintains 100% molecular validity while fully preserving key physicochemical properties (e.g., logP, synthetic accessibility, QED) of the original molecules; it matches or exceeds baseline models in diversity, novelty, and drug-likeness. Crucially, SmiSelf integrates seamlessly into existing SMILES-based generative models without requiring retraining. This work establishes a reliable, scalable paradigm for LLM-driven molecular design in drug discovery and materials science.

Technology Category

Application Category

📝 Abstract
Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs' ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES' mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs' practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at https://github.com/wentao228/SmiSelf.
Problem

Research questions and friction points this paper is trying to address.

Ensuring LLMs generate 100% valid molecular structures
Addressing invalid SMILES generation in few-shot learning settings
Correcting invalid molecules while preserving key chemical properties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SELFIES representation for valid molecule generation
Introduces SmiSelf framework for invalid SMILES correction
Converts invalid SMILES to SELFIES using grammatical rules