Multi-scale Graph Autoregressive Modeling: Molecular Property Prediction via Next Token Prediction

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing molecular property prediction methods, which either neglect explicit topological information in SMILES sequence modeling or risk disrupting critical chemical features—such as activity cliffs—through graph-native masking strategies. To overcome these challenges, the authors propose Connection-Aware Motif Sequencing (CamS), a novel approach that transforms molecular graphs into multiscale causal sequences by integrating connection-aware motif mining with scaffold-rooted BFS serialization, thereby preserving both structural integrity and chemically sensitive details. Built upon a LLaMA-based autoregressive pretraining framework, CamS-LLaMA achieves state-of-the-art performance on the MoleculeNet and MoleculeACE activity cliff benchmarks, significantly outperforming current SMILES-based language models and graph neural networks while offering strong interpretability.

Technology Category

Application Category

📝 Abstract
We present Connection-Aware Motif Sequencing (CamS), a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via standard next-token prediction (NTP). For molecular property prediction, SMILES-based NTP scales well but lacks explicit topology, whereas graph-native masked modeling captures connectivity but risks disrupting the pivotal chemical details (e.g., activity cliffs). CamS bridges this gap by serializing molecular graphs into structure-rich causal sequences. CamS first mines data-driven connection-aware motifs. It then serializes motifs via scaffold-rooted breadth-first search (BFS) to establish a stable core-to-periphery order. Crucially, CamS enables hierarchical modeling by concatenating sequences from fine to coarse motif scales, allowing the model to condition global scaffolds on dense, uncorrupted local structural evidence. We instantiate CamS-LLaMA by pre-training a vanilla LLaMA backbone on CamS sequences. It achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines. Interpretability analysis confirms that our multi-scale causal serialization effectively drives attention toward cliff-determining differences.
Problem

Research questions and friction points this paper is trying to address.

molecular property prediction
graph representation
activity cliffs
topology
chemical structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-scale graph modeling
connection-aware motif sequencing
next-token prediction
molecular property prediction
causal graph serialization
🔎 Similar Papers
No similar papers found.
Z
Zhuoyang Jiang
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Yaosen Min
Yaosen Min
Zhongguancun Institute of Artificial Intelligence
Computational BiologyBioinformaticsDeep Learning
P
Peiran Jin
Zhongguancun academy, Beijing, China
Lei Chen
Lei Chen
Hong Kong University of Science and Technology
Human Powered Machine LearningDatabasesData Mining