MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

📅 2026-05-19
📈 Citations: 0
✹ Influential: 0
📄 PDF

career value

211K/year
đŸ€– AI Summary
Accurately identifying metabolite structures from MS/MS spectra remains a central challenge in metabolomics. This work proposes MSAlign, the first approach to introduce multimodal alignment into the spectrum–molecule matching task. It maps frozen foundation models—DreaMS for mass spectra and ChemBERTa for molecular representations—into a shared embedding space via a lightweight MLP, trained efficiently with a candidate-based contrastive learning objective. The study formally analyzes the trade-off between data leakage and distribution shift in dataset splitting schemes and introduces quantitative metrics to characterize this balance. MSAlign substantially outperforms existing methods on benchmarks including MassSpecGym and Spectraverse. To foster reproducible research, the authors release all data, splitting strategies, candidate sets, and a unified codebase.
📝 Abstract
Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.
Problem

Research questions and friction points this paper is trying to address.

metabolite identification
mass spectrometry
molecule retrieval
MS/MS spectrum
chemical structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

MSAlign
foundation models
contrastive learning
molecule retrieval
mass spectrometry
🔎 Similar Papers
P
Paul Krzakala
LTCI, Télécom Paris & CMAP, Ecole Polytechnique, Institut Polytechnique de Paris
G
Gabriel Melo
LTCI, Télécom Paris, Institut Polytechnique de Paris
C
Camille Lançon
CEA, INRAE, MetaboHUB, Université Paris-Saclay
C
Charlotte Laclau
LTCI, Télécom Paris, Institut Polytechnique de Paris
Rémi Flamary
Rémi Flamary
CMAP, École Polytechnique, Institut Polytechnique de Paris
Machine LearningOptimal TransportDomain AdaptationGraph processingSignal Processing
E
Etienne Thévenot
CEA, INRAE, MetaboHUB, Université Paris-Saclay
Florence d'Alché-Buc
Florence d'Alché-Buc
Télécom Paris, Institut Polytechnique de Paris
Statistical learningstructured and functional predictionrobustnessoperator-valued kernelsBioinformatics