Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the pervasive issue of missing fragment ions in mass spectrometry-based de novo peptide sequencing, this paper proposes a novel “complete-then-decode” paradigm operating in latent space. Unlike conventional signal-domain imputation, our method introduces a learnable peak-query mechanism grounded in theoretical fragment spectra, formulating missing fragment recovery as a set prediction problem solved via optimal bipartite matching—thereby avoiding spectral distortion inherent in raw-spectrum interpolation. We further integrate an autoregressive decoder to enable end-to-end peptide sequence inference. Evaluated on three standard benchmarks, our approach significantly outperforms existing state-of-the-art methods, achieving substantial gains in sequencing accuracy. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called underline{ extbf{L}}atent underline{ extbf{I}}mputation before underline{ extbf{P}}rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at href{https://github.com/usr922/LIPNovo}{https://github.com/usr922/LIPNovo}.

Problem

Research questions and friction points this paper is trying to address.

Addresses missing fragmentation in peptide sequencing

Proposes latent imputation before peptide prediction

Enhances performance without raw data generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent imputation compensates missing fragmentation information

Set-prediction problem with learnable peak queries

Optimal bipartite matching generates latent representations

🔎 Similar Papers

No similar papers found.