🤖 AI Summary
Interpreting molecular structures from tandem mass spectrometry (MS/MS) spectra remains challenging, and the absence of a standardized, comprehensive benchmark hinders systematic development of AI-based methods. To address this, we introduce MolSpecBench—the first integrated, open-source benchmark for MS/MS spectral interpretation. It formally defines three core tasks: de novo molecular generation, cross-modal molecular retrieval, and spectrum-conditioned molecular generation. We propose a skeleton-aware data splitting strategy to assess generalization and design task-specific evaluation metrics. MolSpecBench integrates over 100,000 high-quality, experimentally acquired MS/MS spectra—each annotated with exact masses, fragment ions, and ground-truth molecular structures—spanning diverse instrumentation platforms and chemical space. As the largest publicly available, high-fidelity MS/MS benchmark to date, it includes full implementation code and reproducible protocols. MolSpecBench significantly lowers barriers to algorithm validation and systematically supports research in deep generative modeling, similarity-based retrieval, and spectrum representation learning, thereby accelerating AI-driven structural elucidation of unknown molecules.
📝 Abstract
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: extit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at url{https://github.com/pluskal-lab/MassSpecGym}.