pUniFind: a unified large pre-trained deep learning model pushing the limit of mass spectra interpretation

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing mass spectrometry de novo sequencing methods predominantly employ deep learning solely for feature extraction, lack a unified scoring framework, and suffer from significant limitations in zero-shot de novo sequencing, large-scale post-translational modification (PTM) support, and peptide identification sensitivity. This work introduces the first proteomics-oriented multimodal pre-trained model, which achieves end-to-end peptide–spectrum matching and open-vocabulary zero-shot de novo sequencing via spectral–sequence cross-modal alignment. The model supports over 1,300 PTMs and maintains substantially improved identification performance even under expanded search spaces. Experiments demonstrate a 42.6% increase in identified peptides in immunopeptidomics, a 60% improvement in peptide-spectrum matches (PSMs) over state-of-the-art de novo methods, and successful recovery of peptides from 38.5% of low-quality spectra—including 1,891 novel genomic-mapping sequences absent from reference proteomes.

Technology Category

Application Category

📝 Abstract

Deep learning has advanced mass spectrometry data interpretation, yet most models remain feature extractors rather than unified scoring frameworks. We present pUniFind, the first large-scale multimodal pre-trained model in proteomics that integrates end-to-end peptide-spectrum scoring with open, zero-shot de novo sequencing. Trained on over 100 million open search-derived spectra, pUniFind aligns spectral and peptide modalities via cross modality prediction and outperforms traditional engines across diverse datasets, particularly achieving a 42.6 percent increase in the number of identified peptides in immunopeptidomics. Supporting over 1,300 modifications, pUniFind identifies 60 percent more PSMs than existing de novo methods despite a 300-fold larger search space. A deep learning based quality control module further recovers 38.5 percent additional peptides including 1,891 mapped to the genome but absent from reference proteomes while preserving full fragment ion coverage. These results establish a unified, scalable deep learning framework for proteomic analysis, offering improved sensitivity, modification coverage, and interpretability.

Problem

Research questions and friction points this paper is trying to address.

Develop unified scoring framework for mass spectra interpretation

Improve peptide identification accuracy in proteomics data

Support large-scale modification detection in spectral analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large multimodal pre-trained model for proteomics

Cross modality prediction for spectral alignment

Deep learning quality control for peptide recovery

🔎 Similar Papers

FraGNNet: A Deep Probabilistic Model for Mass Spectrum Prediction