SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision

📅 2024-12-07

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing SMILES pretraining models rely solely on single-token supervision, neglecting substructural semantics, and are trained only on corrupted SMILES strings—leading to weak supervisory signals and train-inference mismatch. To address these limitations, we propose SMI-Editor, an edit-based pretraining paradigm that randomly perturbs molecular substructures (rather than individual atoms or bonds) and reconstructs the original valid SMILES, thereby enabling fragment-level supervision and joint modeling of chemical validity. Built upon a Transformer architecture, SMI-Editor explicitly incorporates SMILES syntactic constraints and chemical substructure priors. This work is the first to introduce edit operations into molecular language modeling. Evaluated across multiple downstream tasks, SMI-Editor achieves state-of-the-art performance—outperforming several 3D-aware representation models—and significantly enhances molecular semantic understanding and generation capabilities.

Technology Category

Application Category

📝 Abstract

SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.

Problem

Research questions and friction points this paper is trying to address.

Existing SMILES LMs lack fragment-level molecular supervision.

Current models suffer from train-inference mismatch with invalid SMILES.

SMI-Editor improves molecular representation via edit-based fragment reconstruction.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Edit-based SMILES LM with fragment-level supervision

Uses valid SMILES inputs for training

Randomly disrupts substructures for reconstruction

🔎 Similar Papers

Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models