Enhancing Molecular Property Prediction with Knowledge from Large Language Models

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Molecular property prediction (MPP) suffers from insufficient integration of human prior knowledge and large language models’ (LLMs’) knowledge gaps and hallucinations—particularly for sparse properties. Method: We propose a knowledge-enhanced multimodal fusion framework that, for the first time, jointly leverages domain-specific knowledge and executable Python code generated by LLMs (GPT-4o, GPT-4.1, DeepSeek-R1) to construct semantically rich molecular representations; these are then aligned and fused with structural representations extracted by graph neural networks via cross-modal alignment. Our approach requires no LLM fine-tuning, ensuring both interpretability and computational efficiency. Contribution/Results: Evaluated on multiple standard MPP benchmarks, our method significantly outperforms existing state-of-the-art approaches, demonstrating the effectiveness of knowledge-guided representation learning in enhancing generalization, robustness, and adaptability in few-shot settings.

Technology Category

Application Category

📝 Abstract
Predicting molecular properties is a critical component of drug discovery. Recent advances in deep learning, particularly Graph Neural Networks (GNNs), have enabled end-to-end learning from molecular structures, reducing reliance on manual feature engineering. However, while GNNs and self-supervised learning approaches have advanced molecular property prediction (MPP), the integration of human prior knowledge remains indispensable, as evidenced by recent methods that leverage large language models (LLMs) for knowledge extraction. Despite their strengths, LLMs are constrained by knowledge gaps and hallucinations, particularly for less-studied molecular properties. In this work, we propose a novel framework that, for the first time, integrates knowledge extracted from LLMs with structural features derived from pre-trained molecular models to enhance MPP. Our approach prompts LLMs to generate both domain-relevant knowledge and executable code for molecular vectorization, producing knowledge-based features that are subsequently fused with structural representations. We employ three state-of-the-art LLMs, GPT-4o, GPT-4.1, and DeepSeek-R1, for knowledge extraction. Extensive experiments demonstrate that our integrated method outperforms existing approaches, confirming that the combination of LLM-derived knowledge and structural information provides a robust and effective solution for MPP.
Problem

Research questions and friction points this paper is trying to address.

Integrating LLM knowledge with molecular structures for property prediction
Addressing knowledge gaps and hallucinations in LLMs for MPP
Fusing LLM-generated features with pre-trained molecular representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates LLM knowledge with pre-trained molecular models
Generates domain knowledge and code for vectorization
Fuses knowledge-based features with structural representations
🔎 Similar Papers
No similar papers found.
P
Peng Zhou
Hunan University
L
Lai Hou Tim
Tencent AI for Life Science Lab
Z
Zhixiang Cheng
Hunan University
K
Kun Xie
Tencent AI for Life Science Lab
Chaoyi Li
Chaoyi Li
College of Computer Science and Engineering, Hunan University
Ai4ScienceComputer VIsionDrug Discovery
W
Wei Liu
Tencent AI for Life Science Lab
Xiangxiang Zeng
Xiangxiang Zeng
Deparment of Computer Science, Hunan University
Computational IntelligenceAI4ScienceAI for Drug Discovery