NovoMolGen: Rethinking Molecular Language Model Pretraining

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Molecular language models suffer from unclear pretraining mechanisms and weak correlations between textual representation quality and generative performance. Method: This work introduces the NovoMolGen Transformer family, pretrained on 1.5 billion molecules, to systematically investigate the impact of SMILES representation, subword tokenization, model scale, and dataset size on de novo molecular generation. Contribution/Results: We首次 discover that conventional pretraining metrics—e.g., reconstruction loss—exhibit weak correlation with downstream generation performance, revealing a fundamental divergence between molecular and natural language pretraining paradigms and motivating a paradigm shift. Leveraging large-scale self-supervised pretraining and a standardized evaluation framework, NovoMolGen achieves state-of-the-art performance across both unconstrained generation and target-directed tasks—including property optimization and conditional generation—significantly outperforming existing molecular foundation models and task-specific generators.

Technology Category

Application Category

📝 Abstract
Designing de-novo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$ possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.
Problem

Research questions and friction points this paper is trying to address.

Investigating impact of language modeling practices on molecular generation performance
Understanding weak correlation between pretraining metrics and downstream results
Advancing efficient exploration of vast chemical space for molecule design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based foundation models pretrained on 1.5B molecules
Systematically investigates tokenization strategies and dataset scale
Establishes new state-of-the-art in molecular generation tasks
🔎 Similar Papers
No similar papers found.
K
Kamran Chitsaz
Chandar Research Lab, Mila – Quebec AI Institute
R
Roshan Balaji
BioSystems Engineering and Control Lab, Wadhwani School of Data Science and AI, IIT Madras, The Centre for Integrative Biology and Systems medicinE (IBSE)
Quentin Fournier
Quentin Fournier
Research Fellow at Mila - Quebec AI Institute
Deep LearningNatural Language ProcessingDrug Discovery
N
Nirav Pravinbhai Bhatt
BioSystems Engineering and Control Lab, Wadhwani School of Data Science and AI, IIT Madras, The Centre for Integrative Biology and Systems medicinE (IBSE), IIT Madras Zanzibar
Sarath Chandar
Sarath Chandar
Associate Professor @ Polytechnique Montreal. Mila. Canada CIFAR AI Chair. Canada Research Chair.
Artificial IntelligenceMachine LearningDeep LearningReinforcement LearningNLP