FRIGID: Scaling Diffusion-Based Molecular Generation from Mass Spectra at Training and Inference Time

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This work addresses the challenging problem of de novo molecular structure elucidation from mass spectra by proposing FRIGID, a diffusion-based language modeling framework. Pretrained on hundreds of millions of unlabeled molecules, FRIGID generates candidate structures conditioned on input mass spectra through an intermediate fingerprint representation and molecular formula constraints. A forward fragmentation model is further integrated during inference to correct structural fragments inconsistent with the observed spectrum. The approach innovatively extends both training and inference phases: leveraging massive unlabeled data during training and employing targeted remasking and denoising strategies at inference time to dynamically refine predictions, yielding log-linear performance gains with increased computational resources. On the MassSpecGym and NPLIB1 benchmarks, FRIGID achieves Top-1 accuracy exceeding 18% and outperforms prior state-of-the-art methods by a factor of three, substantially advancing the performance frontier in this domain.

Technology Category

Application Category

📝 Abstract
In this work, we present FRIGID, a framework with a novel diffusion language model that generates molecular structures conditioned on mass spectra via intermediate fingerprint representations and determined chemical formulae, training at the scale of hundreds of millions of unlabeled structures. We then demonstrate how forward fragmentation models enable inference-time scaling by identifying spectrum-inconsistent fragments and refining them through targeted remasking and denoising. While FRIGID already achieves strong performance with its diffusion base, inference-time scaling significantly improves its accuracy, surpassing 18% Top-1 accuracy on the challenging MassSpecGym benchmark and tripling the Top-1 accuracy of the leading methods on NPLIB1. Further empirical analyses show that FRIGID exhibits log-linear performance scaling with increasing inference-time compute, opening a promising new direction for continued improvements in de novo structural elucidation. FRIGID code is publicly available at https://github.com/coleygroup/FRIGID
Problem

Research questions and friction points this paper is trying to address.

molecular generation
mass spectra
structural elucidation
de novo
diffusion model
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion language model
inference-time scaling
mass spectra
molecular generation
fragmentation modeling
🔎 Similar Papers
M
Montgomery Bohde
Massachusetts Institute of Technology, Cambridge, MA, United States
Hongxuan Liu
Hongxuan Liu
PhD Student, MIT
Mathematical OptimizationAI for ScienceHigh Performance Computing
M
Mrunali Manjrekar
Massachusetts Institute of Technology, Cambridge, MA, United States
M
Magdalena Lederbauer
Massachusetts Institute of Technology, Cambridge, MA, United States
S
Shuiwang Ji
Texas A&M University, College Station, TX, United States
Runzhong Wang
Runzhong Wang
Postdoc, MIT
combinatorial optimizationcomputational metabolomicsgraph matching
Connor W. Coley
Connor W. Coley
Massachusetts Institute of Technology
machine learningdrug discoveryautomationsynthetic chemistry