To Bin or not to Bin: Alternative Representations of Mass Spectra

📅 2025-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the information loss inherent in manual binning of mass spectrometry (MS) data for machine learning, which distorts peak positions and intensities and imposes artificial resolution constraints. To overcome this limitation, we propose two binning-free representation paradigms: (1) modeling MS spectra as unordered sets of peaks and encoding them via a Set Transformer; and (2) constructing a graph where nodes represent peaks and edges encode pairwise relationships based on *m/z* proximity and intensity correlation, followed by graph neural network (GNN) processing. Both approaches directly leverage raw peak coordinates, intensities, and their intrinsic structural relationships—bypassing binning-induced distortion entirely. In molecular property regression and similarity learning tasks, both methods significantly outperform conventional binned-spectrum baselines with MLPs (*p* < 0.01). Our work constitutes the first systematic formulation and empirical validation of set- and graph-based representation frameworks for mass spectra, demonstrating that preserving native peak structure is critical for accurate molecular representation.

Technology Category

Application Category

📝 Abstract
Mass spectrometry, especially so-called tandem mass spectrometry, is commonly used to assess the chemical diversity of samples. The resulting mass fragmentation spectra are representations of molecules of which the structure may have not been determined. This poses the challenge of experimentally determining or computationally predicting molecular structures from mass spectra. An alternative option is to predict molecular properties or molecular similarity directly from spectra. Various methodologies have been proposed to embed mass spectra for further use in machine learning tasks. However, these methodologies require preprocessing of the spectra, which often includes binning or sub-sampling peaks with the main reasoning of creating uniform vector sizes and removing noise. Here, we investigate two alternatives to the binning of mass spectra before down-stream machine learning tasks, namely, set-based and graph-based representations. Comparing the two proposed representations to train a set transformer and a graph neural network on a regression task, respectively, we show that they both perform substantially better than a multilayer perceptron trained on binned data.
Problem

Research questions and friction points this paper is trying to address.

Alternative representations of mass spectra
Predict molecular properties from spectra
Improve machine learning model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Set-based mass spectra representation
Graph-based mass spectra representation
Comparison with binned data training
🔎 Similar Papers
N
Niek de Jonge
Bioinformatics Group, University & Research Wageningen, The Netherlands
J
Justin J. J. van der Hooft
Bioinformatics Group, University & Research Wageningen, The Netherlands
Daniel Probst
Daniel Probst
WUR
cheminformaticschemistrymedical chemistrybioinformaticscomputer science