From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
Existing chemical language models struggle to reliably capture the stereochemical information of enantiomers, often relying solely on superficial patterns in SMILES strings. This work proposes Pan-CORE, an autoregressive Transformer encoder–decoder architecture, and employs high-temporal-resolution analysis of training trajectories to reveal, for the first time, a stage-wise leap in chiral semantic learning. The study demonstrates that the encoder predominantly drives the reconstruction of chiral representations and identifies specific chirality-sensitive attention heads critical to this process. The observed abrupt transition in learning behavior is consistently reproduced across multiple Pan-CORE variants, confirming the encoder’s central role in acquiring chiral semantics and offering a new paradigm for interpretable chemical representation learning.
📝 Abstract
Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning stagnation is not explained by model capacity alone and instead reflects the complexity of chiral constraints. Analyses of attention dynamics, residual-stream trajectories, and latent-space geometry support an encoder-centered mechanism in which chiral-token representations undergo transient destabilization and reconstruction, seen as a V-shaped drop and recovery in vector norm and directional stability, together with a clear reorganization of chiral molecular representations in the latent space. Encoder-decoder cross-evaluation further supports the encoder-centered nature of the transition, and targeted attention-head ablation identifies a small set of chiral-sensitive heads whose removal selectively reduces chiral-token accuracy even in the fully trained model. These findings show that SMILES translation can serve as a useful experimental system for mechanistic analysis of semantic emergence in CLMs, with implications for interpretable chemical representation learning.
Problem

Research questions and friction points this paper is trying to address.

chirality
chemical language models
SMILES
semantic learning
enantiomers
Innovation

Methods, ideas, or system contributions that make the work stand out.

chirality
SMILES translation
chemical language models
encoder-centered learning
semantic emergence
Zehao Li
Zehao Li
Peking University
Operations researchStochastic approximation
Y
Yasuhiro Yoshikai
Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan
S
Shumpei Nemoto
Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan
H
Hiroyuki Kusuhara
Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan
T
Tadahaya Mizuno
Laboratory of Molecular Pharmacokinetics, Graduate School of Pharmaceutical Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, Japan; The Institute of Statistical Mathematics (ISM), Research Organization of Information and Systems, 190-8562 Tachikawa, Tokyo, Japan