SmilesT5: Domain-specific pretraining for molecular language models

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge in drug discovery where molecular property prediction relies heavily on high-quality representations but existing pretraining methods lack domain-specific adaptability, this paper proposes a SMILES-based, domain-specialized text-to-text pretraining paradigm. Leveraging the T5 architecture, we design a joint pretraining objective combining masked language modeling with molecular semantic awareness, enabling fixed-length embedding outputs that substantially reduce computational overhead for downstream classification tasks. Evaluated on six mainstream molecular property prediction benchmarks, our method consistently outperforms conventional likelihood-based pretraining and standard fine-tuning. Notably, when frozen embeddings are fed directly into a lightweight classifier, performance matches end-to-end fine-tuning while achieving a 3.2× inference speedup and 67% memory reduction. Our core contribution is the first generative pretraining framework explicitly tailored to the linguistic characteristics of molecular SMILES strings—balancing representation quality, generalization capability, and deployment efficiency.

Technology Category

Application Category

📝 Abstract
Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show that data and computational efficiency can be improved by using these domain-specific pretraining tasks. Finally, the pretrained embeddings from the model can be used as fixed inputs into a downstream machine learning classifier and yield comparable performance to finetuning but with much lower computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Improving molecular property prediction via domain-specific pretraining
Enhancing data and computational efficiency in SMILES-based models
Enabling fixed embeddings for low-overhead downstream classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-specific pretraining for molecular language models
Novel text-to-text pretraining tasks for SMILES
Pretrained embeddings reduce computational overhead
🔎 Similar Papers
No similar papers found.
P
Philip Spence
Department of Biochemistry and Metabolism, John Innes Centre, Norwich, UK; HotHouse Therapeutics, Centrum, Norwich Research Park, Norwich, UK
Brooks Paige
Brooks Paige
Associate Professor, University College London
Machine LearningStatistics
A
Anne Osbourn
Department of Biochemistry and Metabolism, John Innes Centre, Norwich, UK