🤖 AI Summary
Existing chemical language models struggle to reliably capture the stereochemical information of enantiomers, often relying solely on superficial patterns in SMILES strings. This work proposes Pan-CORE, an autoregressive Transformer encoder–decoder architecture, and employs high-temporal-resolution analysis of training trajectories to reveal, for the first time, a stage-wise leap in chiral semantic learning. The study demonstrates that the encoder predominantly drives the reconstruction of chiral representations and identifies specific chirality-sensitive attention heads critical to this process. The observed abrupt transition in learning behavior is consistently reproduced across multiple Pan-CORE variants, confirming the encoder’s central role in acquiring chiral semantics and offering a new paradigm for interpretable chemical representation learning.
📝 Abstract
Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning stagnation is not explained by model capacity alone and instead reflects the complexity of chiral constraints. Analyses of attention dynamics, residual-stream trajectories, and latent-space geometry support an encoder-centered mechanism in which chiral-token representations undergo transient destabilization and reconstruction, seen as a V-shaped drop and recovery in vector norm and directional stability, together with a clear reorganization of chiral molecular representations in the latent space. Encoder-decoder cross-evaluation further supports the encoder-centered nature of the transition, and targeted attention-head ablation identifies a small set of chiral-sensitive heads whose removal selectively reduces chiral-token accuracy even in the fully trained model. These findings show that SMILES translation can serve as a useful experimental system for mechanistic analysis of semantic emergence in CLMs, with implications for interpretable chemical representation learning.