đ€ AI Summary
Accurately identifying metabolite structures from MS/MS spectra remains a central challenge in metabolomics. This work proposes MSAlign, the first approach to introduce multimodal alignment into the spectrumâmolecule matching task. It maps frozen foundation modelsâDreaMS for mass spectra and ChemBERTa for molecular representationsâinto a shared embedding space via a lightweight MLP, trained efficiently with a candidate-based contrastive learning objective. The study formally analyzes the trade-off between data leakage and distribution shift in dataset splitting schemes and introduces quantitative metrics to characterize this balance. MSAlign substantially outperforms existing methods on benchmarks including MassSpecGym and Spectraverse. To foster reproducible research, the authors release all data, splitting strategies, candidate sets, and a unified codebase.
đ Abstract
Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.