CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the conditional mismatch problem in mass spectrometry–to–molecular structure generation, which arises from noisy fingerprint predictions and particularly degrades performance on long-tailed substructures. To mitigate this issue, the authors propose a robust generative framework that pretrains a spectrum encoder on synthetic mass spectra, trains the decoder with frequency-aware fingerprint perturbation, and integrates structure-aware autoregressive decoding with lightweight chemical constraints. This approach is the first to systematically align the fingerprint conditional distributions between training and inference, substantially improving generation accuracy for underrepresented structures. The method achieves state-of-the-art results on NPLIB1 with 19.54% Top-1 and 29.92% Top-10 exact match accuracy, while remaining competitive on MassSpecGym.
📝 Abstract
Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.
Problem

Research questions and friction points this paper is trying to address.

spectrum-to-structure generation
molecular fingerprint prediction
condition mismatch
de novo structure elucidation
tandem mass spectrometry
Innovation

Methods, ideas, or system contributions that make the work stand out.

spectrum-to-structure generation
fingerprint corruption
SELFIES representation
condition mismatch
autoregressive decoding
🔎 Similar Papers
No similar papers found.