KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing molecular large language models suffer from sparse textual descriptions and semantically impoverished molecular representations, limiting their molecular understanding capability. To address this, we introduce KnowMol-100K—the first large-scale, multi-level chemically annotated dataset—and propose a chemistry-informed molecular representation method that explicitly integrates prior knowledge including functional groups, reactivity patterns, and pharmacophores. Building upon this, we develop an end-to-end multimodal large language model that jointly optimizes a molecular graph neural network, a text encoder, and a cross-modal contrastive learning objective. Our approach achieves state-of-the-art performance across eight diverse tasks—including molecular property prediction, reaction prediction, retrosynthetic planning, and text-to-molecule generation—demonstrating that high-quality chemical knowledge injection fundamentally enhances both molecular understanding and generation. This work establishes a new “knowledge-driven + representation-enhanced” paradigm for molecular AI.

Technology Category

Application Category

📝 Abstract
The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K
Problem

Research questions and friction points this paper is trying to address.

Addresses inadequate textual descriptions in molecular language models
Improves suboptimal molecular representation strategies during pretraining
Bridges the gap between molecules and textual descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level molecular annotations bridge text-molecule gap
Chemically-informative molecular representation addresses limitations
Multi-modal molecular large language model achieves superior performance
🔎 Similar Papers
No similar papers found.
Z
Zaifei Yang
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences (CAS), China
Hong Chang
Hong Chang
Researcher at Institute of Computing Technology, Chinese Academy of Sciences
Machine LearningComputer VisionPattern Recognition
Ruibing Hou
Ruibing Hou
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionDeep Learning
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
X
Xilin Chen
State Key Laboratory of AI Safety, Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences (CAS), China