AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of adapting atom-level local environments in molecular graphs to the sequential processing paradigm of large language models (LLMs). We propose AtomDisc—the first learnable, structure-aware atomic tokenization framework. It maps atomic neighborhoods, encoded via graph-based local substructures and data-driven clustering, into chemically meaningful discrete tokens that integrate seamlessly into pretrained molecular LLMs. Its core innovation lies in incorporating interpretable inductive biases to endow models with explicit structural awareness. On molecular property prediction and generation benchmarks, AtomDisc achieves state-of-the-art performance, significantly outperforming existing methods. Moreover, it enables post-hoc attribution of key structural motifs to predicted properties, revealing causal structure–property relationships. Thus, AtomDisc establishes a new paradigm for interpretable, structure-grounded molecular AI.

Technology Category

Application Category

📝 Abstract
Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Adapting molecular information to token-based LLM processing remains challenging
Lacking effective fine-grained tokenization of local atomic environments for chemical properties
Need to distinguish chemically meaningful structural features revealing structure-property associations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Atom-level tokenizer quantizes local atomic environments
Structure-aware tokens embed directly in LLM token space
Data-driven method reveals interpretable structure-property associations
🔎 Similar Papers
No similar papers found.
M
Mingxu Zhang
The Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.
Dazhong Shen
Dazhong Shen
Nanjing University of Aeronautics and Astronautics
Data MiningGenerative AI
Y
Ying Sun
The Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China.