🤖 AI Summary
Existing molecular large language models suffer from sparse textual descriptions and semantically impoverished molecular representations, limiting their molecular understanding capability. To address this, we introduce KnowMol-100K—the first large-scale, multi-level chemically annotated dataset—and propose a chemistry-informed molecular representation method that explicitly integrates prior knowledge including functional groups, reactivity patterns, and pharmacophores. Building upon this, we develop an end-to-end multimodal large language model that jointly optimizes a molecular graph neural network, a text encoder, and a cross-modal contrastive learning objective. Our approach achieves state-of-the-art performance across eight diverse tasks—including molecular property prediction, reaction prediction, retrosynthetic planning, and text-to-molecule generation—demonstrating that high-quality chemical knowledge injection fundamentally enhances both molecular understanding and generation. This work establishes a new “knowledge-driven + representation-enhanced” paradigm for molecular AI.
📝 Abstract
The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks.
GitHub: https://github.com/yzf-code/KnowMol
Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K