SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision

📅 2024-12-07
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing SMILES pretraining models rely solely on single-token supervision, neglecting substructural semantics, and are trained only on corrupted SMILES strings—leading to weak supervisory signals and train-inference mismatch. To address these limitations, we propose SMI-Editor, an edit-based pretraining paradigm that randomly perturbs molecular substructures (rather than individual atoms or bonds) and reconstructs the original valid SMILES, thereby enabling fragment-level supervision and joint modeling of chemical validity. Built upon a Transformer architecture, SMI-Editor explicitly incorporates SMILES syntactic constraints and chemical substructure priors. This work is the first to introduce edit operations into molecular language modeling. Evaluated across multiple downstream tasks, SMI-Editor achieves state-of-the-art performance—outperforming several 3D-aware representation models—and significantly enhances molecular semantic understanding and generation capabilities.

Technology Category

Application Category

📝 Abstract
SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.
Problem

Research questions and friction points this paper is trying to address.

Existing SMILES LMs lack fragment-level molecular supervision.
Current models suffer from train-inference mismatch with invalid SMILES.
SMI-Editor improves molecular representation via edit-based fragment reconstruction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Edit-based SMILES LM with fragment-level supervision
Uses valid SMILES inputs for training
Randomly disrupts substructures for reconstruction
🔎 Similar Papers
No similar papers found.
Kangjie Zheng
Kangjie Zheng
Wellcome Sanger Institute
AI4ScienceNLPLarge Language Model
S
Siyue Liang
School of Computer Science, Peking University; National Key Laboratory for Multimedia Information Processing, Peking University; Peking University-Anker Embodied AI Lab, Peking University.
Junwei Yang
Junwei Yang
Peking University
Natural Language ProcessingGraph Neural NetworkAi4Science
B
Bin Feng
School of Computer Science, Peking University; National Key Laboratory for Multimedia Information Processing, Peking University; Peking University-Anker Embodied AI Lab, Peking University; International Digital Economy Academy (IDEA), Shenzhen, China.
Zequn Liu
Zequn Liu
Microsoft Research AI4Science, Asia
W
Wei Ju
College of Computer Science, Sichuan University, Chengdu, China.
Zhiping Xiao
Zhiping Xiao
Postdoc at University of Washington
CSEDMML
M
Ming Zhang
School of Computer Science, Peking University; National Key Laboratory for Multimedia Information Processing, Peking University; Peking University-Anker Embodied AI Lab, Peking University.