SMolLM: Small Language Models Learn Small Molecular Grammar

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current molecular language models are often parameter-heavy yet lack a clear understanding of how chemical grammar rules are learned. This work proposes SMolLM, a weight-sharing Transformer model with only 53K parameters, which achieves 95% validity on ZINC-250K through iterative SMILES generation—outperforming standard GPT models with one-tenth the parameter count. Interpretability analyses reveal that the model incrementally satisfies syntactic constraints in a structured sequence: first handling parentheses, then ring closures, and finally valency rules. Notably, a single attention head is identified as exclusively responsible for bracket matching, offering the first mechanistic insight into how formal linguistic structures can be computed iteratively within transformer-based architectures.

📝 Abstract

Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed order: brackets first, rings second, and valence last, as shown by error classification, linear probing, and sparse autoencoders. A systematic ablation across attention heads and passes further localizes the first bracket-matching step to a single attention head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.

Problem

Research questions and friction points this paper is trying to address.

molecular design

chemical grammar

language models

SMILES

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

small language models

molecular grammar

SMILES generation