MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Accurately identifying metabolite structures from MS/MS spectra remains a central challenge in metabolomics. This work proposes MSAlign, the first approach to introduce multimodal alignment into the spectrum–molecule matching task. It maps frozen foundation models—DreaMS for mass spectra and ChemBERTa for molecular representations—into a shared embedding space via a lightweight MLP, trained efficiently with a candidate-based contrastive learning objective. The study formally analyzes the trade-off between data leakage and distribution shift in dataset splitting schemes and introduces quantitative metrics to characterize this balance. MSAlign substantially outperforms existing methods on benchmarks including MassSpecGym and Spectraverse. To foster reproducible research, the authors release all data, splitting strategies, candidate sets, and a unified codebase.

📝 Abstract

Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.

Problem

Research questions and friction points this paper is trying to address.

metabolite identification

mass spectrometry

molecule retrieval

MS/MS spectrum

chemical structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

MSAlign

foundation models

contrastive learning