🤖 AI Summary
Molecular discovery remains constrained by traditional computational methods’ limited capacity to integrate heterogeneous data modalities and domain-specific constraints. Method: This work systematically investigates how large language models (LLMs) can transform molecular discovery, focusing on two core tasks—text- and symbol (SMILES/SELFIES)-conditioned molecular generation and multi-modal molecular optimization. We propose the first taxonomy of LLM-based molecular discovery tasks, unifying autoregressive generation, instruction tuning, reinforcement learning–based optimization, symbolic constraint decoding, and multi-modal alignment. An extensible evaluation framework is built upon standard benchmarks (e.g., ChEMBL, ZINC), complemented by a structured knowledge graph and an open-source resource repository supporting reproducibility and continuous updates. Contribution/Results: The study establishes a foundational theoretical paradigm, technical pipeline, and benchmark suite for LLM–computational chemistry integration, enabling rigorous, scalable, and interpretable molecular design.
📝 Abstract
Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at https://github.com/REAL-Lab-NU/Awesome-LLM-Centric-Molecular-Discovery.