KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing molecular large language models suffer from sparse textual descriptions and semantically impoverished molecular representations, limiting their molecular understanding capability. To address this, we introduce KnowMol-100K—the first large-scale, multi-level chemically annotated dataset—and propose a chemistry-informed molecular representation method that explicitly integrates prior knowledge including functional groups, reactivity patterns, and pharmacophores. Building upon this, we develop an end-to-end multimodal large language model that jointly optimizes a molecular graph neural network, a text encoder, and a cross-modal contrastive learning objective. Our approach achieves state-of-the-art performance across eight diverse tasks—including molecular property prediction, reaction prediction, retrosynthetic planning, and text-to-molecule generation—demonstrating that high-quality chemical knowledge injection fundamentally enhances both molecular understanding and generation. This work establishes a new “knowledge-driven + representation-enhanced” paradigm for molecular AI.

Technology Category

Application Category

📝 Abstract

The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K

Problem

Research questions and friction points this paper is trying to address.

Addresses inadequate textual descriptions in molecular language models

Improves suboptimal molecular representation strategies during pretraining

Bridges the gap between molecules and textual descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-level molecular annotations bridge text-molecule gap

Chemically-informative molecular representation addresses limitations

Multi-modal molecular large language model achieves superior performance

🔎 Similar Papers

No similar papers found.