LipidBERT: A Lipid Language Model Pre-trained on METiS de novo Lipid Library

📅 2024-08-12
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Data scarcity and inadequate molecular representation hinder organ-targeting prediction for lipid nanoparticles (LNPs). Method: We construct a million-scale virtual lipid library and propose LipidBERT—the first pre-trained language model specifically designed for lipid chemical structures. LipidBERT employs a novel “bilingual modeling paradigm,” jointly learning from lipid SMILES sequences (“dry-lab language”) and LNP in vitro/in vivo performance metrics (“wet-lab language”), integrating METiS’s proprietary lipid generation algorithm, masked language modeling (MLM), and multi-task fine-tuning. Contribution/Results: LipidBERT achieves state-of-the-art performance on LNP property prediction tasks, with learned embeddings significantly outperforming those of PhatGPT. It represents the first validation of a lipid foundation model on real wet-lab downstream tasks. Deployed as the core AI filter within the METiS platform, LipidBERT enables high-throughput virtual screening of targeting LNPs and in vivo candidate evaluation.

Technology Category

Application Category

📝 Abstract
In this study, we generate and maintain a database of 10 million virtual lipids through METiS's in-house de novo lipid generation algorithms and lipid virtual screening techniques. These virtual lipids serve as a corpus for pre-training, lipid representation learning, and downstream task knowledge transfer, culminating in state-of-the-art LNP property prediction performance. We propose LipidBERT, a BERT-like model pre-trained with the Masked Language Model (MLM) and various secondary tasks. Additionally, we compare the performance of embeddings generated by LipidBERT and PhatGPT, our GPT-like lipid generation model, on downstream tasks. The proposed bilingual LipidBERT model operates in two languages: the language of ionizable lipid pre-training, using in-house dry-lab lipid structures, and the language of LNP fine-tuning, utilizing in-house LNP wet-lab data. This dual capability positions LipidBERT as a key AI-based filter for future screening tasks, including new versions of METiS de novo lipid libraries and, more importantly, candidates for in vivo testing for orgran-targeting LNPs. To the best of our knowledge, this is the first successful demonstration of the capability of a pre-trained language model on virtual lipids and its effectiveness in downstream tasks using web-lab data. This work showcases the clever utilization of METiS's in-house de novo lipid library as well as the power of dry-wet lab integration.
Problem

Research questions and friction points this paper is trying to address.

Pre-training a BERT-like model (LipidBERT) on virtual lipid data
Improving LNP property prediction using AI-based lipid representation
Integrating dry-lab and wet-lab data for lipid screening tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pre-trained BERT-like model for virtual lipids
Dual-language model for dry-wet lab integration
State-of-the-art LNP property prediction performance
🔎 Similar Papers
No similar papers found.
Tianhao Yu
Tianhao Yu
METiS Pharmaceuticals
C
Cai Yao
METiS Pharmaceuticals
Z
Zhuorui Sun
METiS Pharmaceuticals
F
Feng Shi
METiS Pharmaceuticals
L
Lin Zhang
METiS Pharmaceuticals
K
Kangjie Lyu
METiS Pharmaceuticals
X
Xuan Bai
METiS Pharmaceuticals
A
Andong Liu
METiS Pharmaceuticals
Xicheng Zhang
Xicheng Zhang
METiS Pharmaceuticals
J
Jiali Zou
METiS Pharmaceuticals
W
Wenshou Wang
METiS Pharmaceuticals
C
Chris Lai
METiS Pharmaceuticals
K
Kai Wang
METiS Pharmaceuticals