Generalist Large Language Models for Molecular Property Prediction: Distilling Knowledge from Specialist Models

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited performance of large language models (LLMs) in molecular property prediction tasks, which hinders their practical utility in drug discovery. To bridge this gap, the authors propose TreeKD, a novel framework that, for the first time, translates interpretable rules learned by functional-group-based tree ensemble models—such as decision trees and random forests—into natural language. These rule-derived explanations are then integrated into LLMs via contextual injection and a test-time rule-consistency ensembling strategy, effectively enabling knowledge distillation and rule-augmented inference. Evaluated on 22 ADMET property prediction tasks from the Therapeutics Data Commons (TDC) benchmark, TreeKD substantially enhances LLM performance and significantly narrows the accuracy gap between LLMs and state-of-the-art expert models.

Technology Category

Application Category

📝 Abstract
Molecular Property Prediction (MPP) is a central task in drug discovery. While Large Language Models (LLMs) show promise as generalist models for MPP, their current performance remains below the threshold for practical adoption. We propose TreeKD, a novel knowledge distillation method that transfers complementary knowledge from tree-based specialist models into LLMs. Our approach trains specialist decision trees on functional group features, then verbalizes their learned predictive rules as natural language to enable rule-augmented context learning. This enables LLMs to leverage structural insights that are difficult to extract from SMILES strings alone. We further introduce rule-consistency, a test-time scaling technique inspired by bagging that ensembles predictions across diverse rules from a Random Forest. Experiments on 22 ADMET properties from the TDC benchmark demonstrate that TreeKD substantially improves LLM performance, narrowing the gap with SOTA specialist models and advancing toward practical generalist models for molecular property prediction.
Problem

Research questions and friction points this paper is trying to address.

Molecular Property Prediction
Large Language Models
Generalist Models
Drug Discovery
ADMET
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge distillation
large language models
molecular property prediction
decision trees
rule verbalization
🔎 Similar Papers
No similar papers found.
K
Khiem Le
University of Notre Dame, IN, USA
S
Sreejata Dey
University of Notre Dame, IN, USA
M
Marcos Martínez Galindo
IBM Research
Vanessa Lopez
Vanessa Lopez
Knowledge Media Institute, IBM Research Europe
Semantic WebQuestion AnsweringCity Data
Ting Hua
Ting Hua
University of Notre Dame
Efficient learningCompressionReasoning
N
Nitesh V. Chawla
University of Notre Dame, IN, USA
Hoang Thanh Lam
Hoang Thanh Lam
Research staff, IBM research, Dublin, Ireland
Data mining and machine learning