NatureLM: Deciphering the Language of Nature for Scientific Discovery

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing scientific foundation models are typically trained in isolation on single domains, limiting unified representation learning and cross-domain collaboration. To address this, we propose NatureLM—the first cross-scientific-domain sequence foundation model—that unifies natural entities—including small molecules, proteins, RNA, and materials—into a shared “natural language” representation space, enabling semantic alignment and cross-modal generation. NatureLM introduces a novel multi-domain joint self-supervised pretraining paradigm, integrating SMILES, FASTA, RNA sequences, and crystallographic data within a Transformer architecture to construct a unified 46.7B-parameter sequence space. The model supports text-guided generation, cross-domain design (e.g., protein-to-molecule), and multi-task scientific reasoning. It achieves state-of-the-art performance on SMILES–IUPAC translation and USPTO-50k retrosynthetic prediction, and demonstrates practical efficacy in end-to-end drug discovery, novel material design, and therapeutic protein/RNA generation.

Technology Category

Application Category

📝 Abstract
Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the"language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.
Problem

Research questions and friction points this paper is trying to address.

Integrates scientific domains via NatureLM
Enables cross-domain scientific discovery tasks
Improves performance with increased model size
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-domain sequence-based foundation model
Pre-trained with multi-scientific domain data
Unified model for diverse scientific applications
🔎 Similar Papers
No similar papers found.
Yingce Xia
Yingce Xia
Unknown affiliation
Large Language ModelMachine LearningDrug Discovery
P
Peiran Jin
Microsoft Research AI for Science
Shufang Xie
Shufang Xie
GSAI, Renmin University of China
Machine Learning
L
Liang He
Microsoft Research AI for Science
Chuan Cao
Chuan Cao
Microsoft AI4Science
Computational BiologyVirologyGenomics
Renqian Luo
Renqian Luo
Senior Researcher, Microsoft Research
Artificial IntelligenceMachine LearningDeep Learning
Guoqing Liu
Guoqing Liu
Microsoft Research AI for Science
Artificial IntelligenceReinforcement LearningLarge Language ModelsAI for Science
Y
Yue Wang
Microsoft Research AI for Science
Zequn Liu
Zequn Liu
Microsoft Research AI4Science, Asia
Yuan-Jyue Chen
Yuan-Jyue Chen
Microsoft
DNA nanotechnologybiotechnologyartificial intelligence
Zekun Guo
Zekun Guo
Lecturer in Electrical Engineering at Data Science, AI and Modelling Centre, University of Hull
Smart GridReinforcement LearningModel Predictive ControlLLM Agents
Yeqi Bai
Yeqi Bai
Nanyang Technological University
machine learning
P
Pan Deng
Microsoft Research AI for Science
Yaosen Min
Yaosen Min
Zhongguancun Institute of Artificial Intelligence
Computational BiologyBioinformaticsDeep Learning
Z
Ziheng Lu
Microsoft Research AI for Science
H
Hongxia Hao
Microsoft Research AI for Science
H
Han Yang
Microsoft Research AI for Science
J
Jielan Li
Microsoft Research AI for Science
C
Chang Liu
Microsoft Research AI for Science
J
Jia Zhang
Microsoft Research AI for Science
Jianwei Zhu
Jianwei Zhu
Researcher, Microsoft Research Asia
Machine LearningComputational BiologyBioinformatics
K
Kehan Wu
Microsoft Research AI for Science
W
Wei Zhang
Microsoft Research AI for Science
Kaiyuan Gao
Kaiyuan Gao
Huazhong University of Science and Technology
Visual GenerationAI4Science
Qizhi Pei
Qizhi Pei
PhD Student, Gaoling School of Artificial Intelligence, Renmin University of China
LLMData SynthesisAI4Science
Q
Qian Wang
Microsoft Research AI for Science
X
Xixian Liu
Microsoft Research AI for Science
Y
Yanting Li
Microsoft Research AI for Science
H
Houtian Zhu
Microsoft Research AI for Science
Y
Yeqing Lu
Microsoft Research AI for Science
M
Mingqian Ma
Microsoft Research AI for Science
Z
Zun Wang
Microsoft Research AI for Science
T
Tian Xie
Microsoft Research AI for Science
Krzysztof Maziarz
Krzysztof Maziarz
Microsoft Research AI for Science
M
Marwin H. S. Segler
Microsoft Research AI for Science
Z
Zhao Yang
Microsoft Research AI for Science
Z
Zilong Chen
Microsoft Research AI for Science
Y
Yu Shi
Microsoft Research AI for Science
Shuxin Zheng
Shuxin Zheng
Deputy Director, Zhongguancun Institute of Artificial Intelligence
General AIGenerative AI
Lijun Wu
Lijun Wu
Shanghai AI Laboratory
MLLLMAI4Science
Chen Hu
Chen Hu
School of Artificial Intelligence and Computer Science, Jiangnan University
Geometric Deep LearningMachine Learning
P
Peggy Dai
Microsoft Research AI for Science
Tie-Yan Liu
Tie-Yan Liu
President, Zhongguancun Academy | IEEE Fellow | ACM Fellow | AAIA Fellow
Machine learningAI for ScienceAI for IndustryInformation retrievalNLP
Haiguang Liu
Haiguang Liu
Zhongguancun Academy
ai4sciencebiophysicsstructure biologyx-ray laserserial crystallography
Tao Qin
Tao Qin
Vice President, Zhongguancun Academy
Deep LearningAI4ScienceSpeech SynthesisNeural Machine TranslationInformation Retrieval