DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

📅 2025-02-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address mass-spectrometry (MS)-driven unknown small-molecule structure elucidation, this paper proposes the first MS-conditioned graph diffusion molecular generation framework. Methodologically, it introduces a formula-constrained encoder–decoder architecture: the encoder integrates MS priors—including elemental peak formulas and neutral losses—while the decoder employs a discrete graph diffusion model restricted to heavy-atom composition. Crucially, we propose large-scale pretraining on fingerprint–structure pairs to mitigate the scarcity of structure–MS paired data and explicitly incorporate MS physical principles to enhance chemical validity and generalization. Our approach achieves state-of-the-art performance across multiple benchmarks. Ablation studies validate the efficacy of both the diffusion mechanism and pretraining strategy. Moreover, performance consistently improves with increasing pretraining data scale. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional $ extit{de novo}$ generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on $ extit{de novo}$ molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.
Problem

Research questions and friction points this paper is trying to address.

Generate molecular structures from mass spectra
Improve accuracy in small molecule discovery
Utilize transformer and diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based encoder for mass spectra
Graph diffusion model for molecular structures
Pretraining with fingerprint-structure pairs
🔎 Similar Papers
No similar papers found.
M
Montgomery Bohde
Massachusetts Institute of Technology, Cambridge, MA, United States
M
Mrunali Manjrekar
Massachusetts Institute of Technology, Cambridge, MA, United States
Runzhong Wang
Runzhong Wang
Postdoc, MIT
combinatorial optimizationcomputational metabolomicsgraph matching
S
Shuiwang Ji
Texas A&M University, College Station, TX, United States
Connor W. Coley
Connor W. Coley
Massachusetts Institute of Technology
machine learningdrug discoveryautomationsynthetic chemistry