🤖 AI Summary
This study addresses the unclear mechanism by which loss function selection affects retrieval performance in small-molecule identification using liquid chromatography–tandem mass spectrometry (LC-MS/MS). By integrating theoretical analysis with empirical validation, the work derives regret bounds under different loss functions and reveals a fundamental trade-off between molecular fingerprint prediction accuracy and actual retrieval performance. This trade-off is modulated by the similarity structure of the candidate molecule set. Notably, the research demonstrates that optimizing fingerprint similarity does not necessarily improve—and may even degrade—retrieval effectiveness. These findings provide a theoretical foundation for selecting appropriate loss functions and fingerprint representations, thereby significantly enhancing both the interpretability and efficacy of small-molecule retrieval systems.
📝 Abstract
One of the central challenges in the computational analysis of liquid chromatography-tandem mass spectrometry (LC-MS/MS) data is to identify the compounds underlying the output spectra. In recent years, this problem is increasingly tackled using deep learning methods. A common strategy involves predicting a molecular fingerprint vector from an input mass spectrum, which is then used to search for matches in a chemical compound database. While various loss functions are employed in training these predictive models, their impact on model performance remains poorly understood. In this study, we investigate commonly used loss functions, deriving novel regret bounds that characterize when Bayes-optimal decisions for these objectives must diverge. Our results reveal a fundamental trade-off between the two objectives of (1) fingerprint similarity and (2) molecular retrieval. Optimizing for more accurate fingerprint predictions typically worsens retrieval results, and vice versa. Our theoretical analysis shows this trade-off depends on the similarity structure of candidate sets, providing guidance for loss function and fingerprint selection.