🤖 AI Summary
Natural products (NPs) play a pivotal role in drug discovery, yet general-purpose chemical language models struggle to effectively capture their structural complexity. This work presents the first systematic development and evaluation of NP-specific chemical language models (NPCLMs), pretraining Mamba, Mamba-2, and GPT architectures on approximately one million NPs using eight tokenization strategies—including character-level, Atom-in-SMILES, BPE, and an NP-customized BPE variant. Experimental results demonstrate that Mamba outperforms Mamba-2 and GPT by 1–2% in validity and uniqueness of generated molecules and exhibits fewer long-range dependency errors. Furthermore, on membrane permeability and anticancer activity prediction tasks, Mamba achieves Matthews Correlation Coefficient (MCC) scores 0.02–0.04 higher than GPT. Notably, with only 1M NP samples, it matches or exceeds the performance of general models trained on datasets two orders of magnitude larger, underscoring the efficacy of small-scale, domain-specific pretraining.
📝 Abstract
Language models are widely used in chemistry for molecular property prediction and small-molecule generation, yet Natural Products (NPs) remain underexplored despite their importance in drug discovery. To address this gap, we develop NP-specific chemical language models (NPCLMs) by pre-training state-space models (Mamba and Mamba-2) and comparing them with transformer baselines (GPT). Using a dataset of about 1M NPs, we present the first systematic comparison of selective state-space models and transformers for NP-focused tasks, together with eight tokenization strategies including character-level, Atom-in-SMILES (AIS), byte-pair encoding (BPE), and NP-specific BPE. We evaluate molecule generation (validity, uniqueness, novelty) and property prediction (membrane permeability, taste, anti-cancer activity) using MCC and AUC-ROC. Mamba generates 1-2 percent more valid and unique molecules than Mamba-2 and GPT, with fewer long-range dependency errors, while GPT yields slightly more novel structures. For property prediction, Mamba variants outperform GPT by 0.02-0.04 MCC under random splits, while scaffold splits show comparable performance. Results demonstrate that domain-specific pre-training on about 1M NPs can match models trained on datasets over 100 times larger.