🤖 AI Summary
Structural annotation of MS/MS spectra remains challenging—particularly in complex biological samples—due to difficulties in unambiguously mapping spectra to exact molecular structures. Method: This paper proposes a two-stage de novo molecular generation framework: (1) spectrum-driven molecular scaffold retrieval via contrastive learning; and (2) constrained SMILES generation conditioned on the retrieved scaffold, incorporating spectral attention throughout the stepwise construction process. Contribution/Results: It introduces, for the first time, a spectrum-guided scaffold-constrained generation paradigm with end-to-end spectral attention, drastically reducing the search space and improving structural accuracy. Evaluated on three major benchmarks—NIST23, CANOPUS, and MassSpecGym—the method achieves state-of-the-art performance when using an oracle scaffold retriever, significantly enhancing the identification of novel molecules in the “dark chemical space.”
📝 Abstract
The annotation (assigning structural chemical identities) of MS/MS spectra remains a significant challenge due to the enormous molecular diversity in biological samples and the limited scope of reference databases. Currently, the vast majority of spectral measurements remain in the"dark chemical space"without structural annotations. To improve annotation, we propose MADGEN (Mass-spec Attends to De Novo Molecular GENeration), a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data. MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation starting with the scaffold. In the first stage, given an MS/MS spectrum, we formulate scaffold retrieval as a ranking problem and employ contrastive learning to align mass spectra with candidate molecular scaffolds. In the second stage, starting from the retrieved scaffold, we employ the MS/MS spectrum to guide an attention-based generative model to generate the final molecule. Our approach constrains the molecular generation search space, reducing its complexity and improving generation accuracy. We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym) and evaluate MADGEN's performance with a predictive scaffold retriever and with an oracle retriever. We demonstrate the effectiveness of using attention to integrate spectral information throughout the generation process to achieve strong results with the oracle retriever.