Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the pervasive issue of missing fragment ions in mass spectrometry-based de novo peptide sequencing, this paper proposes a novel “complete-then-decode” paradigm operating in latent space. Unlike conventional signal-domain imputation, our method introduces a learnable peak-query mechanism grounded in theoretical fragment spectra, formulating missing fragment recovery as a set prediction problem solved via optimal bipartite matching—thereby avoiding spectral distortion inherent in raw-spectrum interpolation. We further integrate an autoregressive decoder to enable end-to-end peptide sequence inference. Evaluated on three standard benchmarks, our approach significantly outperforms existing state-of-the-art methods, achieving substantial gains in sequencing accuracy. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
De novo peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models usually encode the observed mass spectra into latent representations from which peptides are predicted autoregressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called underline{ extbf{L}}atent underline{ extbf{I}}mputation before underline{ extbf{P}}rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at href{https://github.com/usr922/LIPNovo}{https://github.com/usr922/LIPNovo}.
Problem

Research questions and friction points this paper is trying to address.

Addresses missing fragmentation in peptide sequencing
Proposes latent imputation before peptide prediction
Enhances performance without raw data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent imputation compensates missing fragmentation information
Set-prediction problem with learnable peak queries
Optimal bipartite matching generates latent representations
🔎 Similar Papers
No similar papers found.
Y
Ye Du
Department of Biomedical Engineering, The Hong Kong Polytechnic University
C
Chen Yang
Department of Biomedical Engineering, The Hong Kong Polytechnic University
N
Nanxi Yu
Department of Biomedical Engineering, The Hong Kong Polytechnic University
Wanyu Lin
Wanyu Lin
The Hong Kong Polytechnic University
Graph LearningAI for ChemistryAI for Materials ScienceCollaborative Learning
Q
Qian Zhao
Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University
Shujun Wang
Shujun Wang
The Hong Kong Polytechnic University
AI for HealthcareSmart AgeingAI for Science