SmilesT5: Domain-specific pretraining for molecular language models

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address the challenge in drug discovery where molecular property prediction relies heavily on high-quality representations but existing pretraining methods lack domain-specific adaptability, this paper proposes a SMILES-based, domain-specialized text-to-text pretraining paradigm. Leveraging the T5 architecture, we design a joint pretraining objective combining masked language modeling with molecular semantic awareness, enabling fixed-length embedding outputs that substantially reduce computational overhead for downstream classification tasks. Evaluated on six mainstream molecular property prediction benchmarks, our method consistently outperforms conventional likelihood-based pretraining and standard fine-tuning. Notably, when frozen embeddings are fed directly into a lightweight classifier, performance matches end-to-end fine-tuning while achieving a 3.2× inference speedup and 67% memory reduction. Our core contribution is the first generative pretraining framework explicitly tailored to the linguistic characteristics of molecular SMILES strings—balancing representation quality, generalization capability, and deployment efficiency.

Technology Category

Application Category

📝 Abstract

Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show that data and computational efficiency can be improved by using these domain-specific pretraining tasks. Finally, the pretrained embeddings from the model can be used as fixed inputs into a downstream machine learning classifier and yield comparable performance to finetuning but with much lower computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Improving molecular property prediction via domain-specific pretraining

Enhancing data and computational efficiency in SMILES-based models

Enabling fixed embeddings for low-overhead downstream classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific pretraining for molecular language models

Novel text-to-text pretraining tasks for SMILES

Pretrained embeddings reduce computational overhead

🔎 Similar Papers

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization