🤖 AI Summary
Existing molecular foundation models are constrained by closed-vocabulary SMILES tokenizers with limited coverage, hindering effective representation of the vast chemical space. To address this, we propose an atom-level complete tokenization framework—the first to achieve full syntactic coverage of the OpenSMILES specification—and introduce two novel tokenizers: Smirk and Smirk-GPE. We systematically evaluate 30 tokenization strategies and establish an n-gram–based linguistic proxy evaluation paradigm. Our approach integrates syntax-aware preprocessing, RoBERTa-style molecular encoder pretraining and fine-tuning. Experiments demonstrate substantial improvements in molecular property prediction accuracy. Moreover, the proposed tokenizers enable joint modeling across nuclear, electronic, and geometric degrees of freedom. The framework exhibits broad applicability in drug discovery, agrochemical design, and energy storage materials development.
📝 Abstract
Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in molecular design and materials science. However, existing models are constrained by closed-vocabulary tokenizers which capture only a fraction of molecular space. In this work, we systematically evaluate thirty tokenizers, including 19 chemistry-specific ones, for their coverage of the SMILES molecular representation language, revealing significant gaps. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by training and fine-tuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics. The proposed tokenizer framework systematically integrates nuclear, electronic, and geometric degrees of freedom; this facilitates applications in pharmacology, agriculture, biology, and energy storage.