🤖 AI Summary
To address the challenge in drug discovery where molecular property prediction relies heavily on high-quality representations but existing pretraining methods lack domain-specific adaptability, this paper proposes a SMILES-based, domain-specialized text-to-text pretraining paradigm. Leveraging the T5 architecture, we design a joint pretraining objective combining masked language modeling with molecular semantic awareness, enabling fixed-length embedding outputs that substantially reduce computational overhead for downstream classification tasks. Evaluated on six mainstream molecular property prediction benchmarks, our method consistently outperforms conventional likelihood-based pretraining and standard fine-tuning. Notably, when frozen embeddings are fed directly into a lightweight classifier, performance matches end-to-end fine-tuning while achieving a 3.2× inference speedup and 67% memory reduction. Our core contribution is the first generative pretraining framework explicitly tailored to the linguistic characteristics of molecular SMILES strings—balancing representation quality, generalization capability, and deployment efficiency.
📝 Abstract
Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show that data and computational efficiency can be improved by using these domain-specific pretraining tasks. Finally, the pretrained embeddings from the model can be used as fixed inputs into a downstream machine learning classifier and yield comparable performance to finetuning but with much lower computational overhead.