Bridging Molecular Graphs and Large Language Models

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large language models (LLMs) lack intrinsic mechanisms to process molecular graph structures, as their token-based architectures are not semantically aligned with discrete graph representations. Method: We propose Graph2Token, a zero-shot graph–text alignment framework that requires no fine-tuning of the LLM backbone. It comprises: (i) a graph-structure encoder that losslessly maps molecular graphs into discrete, LLM-compatible tokens; (ii) a cross-source molecular–text paired dataset (ChEBI/HMDB) augmented with IUPAC name–based prompts; and (iii) cross-modal feature alignment to unify graph and text semantics. Contribution/Results: On molecular classification and regression tasks under few-shot settings, Graph2Token achieves an average performance gain of 12.7% over state-of-the-art baselines. This work is the first to demonstrate the feasibility and effectiveness of native LLM-based molecular graph understanding without architectural modification or task-specific adaptation.

Technology Category

Application Category

📝 Abstract

While Large Language Models (LLMs) have shown exceptional generalization capabilities, their ability to process graph data, such as molecular structures, remains limited. To bridge this gap, this paper proposes Graph2Token, an efficient solution that aligns graph tokens to LLM tokens. The key idea is to represent a graph token with the LLM token vocabulary, without fine-tuning the LLM backbone. To achieve this goal, we first construct a molecule-text paired dataset from multisources, including CHEBI and HMDB, to train a graph structure encoder, which reduces the distance between graphs and texts representations in the feature space. Then, we propose a novel alignment strategy that associates a graph token with LLM tokens. To further unleash the potential of LLMs, we collect molecular IUPAC name identifiers, which are incorporated into the LLM prompts. By aligning molecular graphs as special tokens, we can activate LLM generalization ability to molecular few-shot learning. Extensive experiments on molecular classification and regression tasks demonstrate the effectiveness of our proposed Graph2Token.

Problem

Research questions and friction points this paper is trying to address.

Bridging molecular graphs and LLMs for better processing.

Aligning graph tokens to LLM tokens without fine-tuning.

Enhancing molecular classification and regression using Graph2Token.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph2Token aligns graph tokens to LLM tokens

Graph structure encoder reduces graph-text distance

Molecular IUPAC names enhance LLM prompts

🔎 Similar Papers

HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment