pUniFind: a unified large pre-trained deep learning model pushing the limit of mass spectra interpretation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing mass spectrometry de novo sequencing methods predominantly employ deep learning solely for feature extraction, lack a unified scoring framework, and suffer from significant limitations in zero-shot de novo sequencing, large-scale post-translational modification (PTM) support, and peptide identification sensitivity. This work introduces the first proteomics-oriented multimodal pre-trained model, which achieves end-to-end peptide–spectrum matching and open-vocabulary zero-shot de novo sequencing via spectral–sequence cross-modal alignment. The model supports over 1,300 PTMs and maintains substantially improved identification performance even under expanded search spaces. Experiments demonstrate a 42.6% increase in identified peptides in immunopeptidomics, a 60% improvement in peptide-spectrum matches (PSMs) over state-of-the-art de novo methods, and successful recovery of peptides from 38.5% of low-quality spectra—including 1,891 novel genomic-mapping sequences absent from reference proteomes.

Technology Category

Application Category

📝 Abstract
Deep learning has advanced mass spectrometry data interpretation, yet most models remain feature extractors rather than unified scoring frameworks. We present pUniFind, the first large-scale multimodal pre-trained model in proteomics that integrates end-to-end peptide-spectrum scoring with open, zero-shot de novo sequencing. Trained on over 100 million open search-derived spectra, pUniFind aligns spectral and peptide modalities via cross modality prediction and outperforms traditional engines across diverse datasets, particularly achieving a 42.6 percent increase in the number of identified peptides in immunopeptidomics. Supporting over 1,300 modifications, pUniFind identifies 60 percent more PSMs than existing de novo methods despite a 300-fold larger search space. A deep learning based quality control module further recovers 38.5 percent additional peptides including 1,891 mapped to the genome but absent from reference proteomes while preserving full fragment ion coverage. These results establish a unified, scalable deep learning framework for proteomic analysis, offering improved sensitivity, modification coverage, and interpretability.
Problem

Research questions and friction points this paper is trying to address.

Develop unified scoring framework for mass spectra interpretation
Improve peptide identification accuracy in proteomics data
Support large-scale modification detection in spectral analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large multimodal pre-trained model for proteomics
Cross modality prediction for spectral alignment
Deep learning quality control for peptide recovery
🔎 Similar Papers
No similar papers found.
J
Jiale Zhao
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing, 100190, Beijing, China.
P
Pengzhi Mao
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing, 100190, Beijing, China.
K
Kaifei Wang
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing, 100190, Beijing, China.
Y
Yiming Li
University of Chinese Academy of Sciences, Beijing, China.
Y
Yaping Peng
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing, 100190, Beijing, China.
R
Ranfei Chen
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing, 100190, Beijing, China.
S
Shuqi Lu
DP Technology Co., Ltd. Beijing, China.
X
Xiaohong Ji
DP Technology Co., Ltd. Beijing, China.
J
Jiaxiang Ding
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Beijing, 100190, Beijing, China.
X
Xin Zhang
University of Chinese Academy of Sciences, Beijing, China.
Y
Yucheng Liao
Center for Machine Learning Research, Peking University, Beijing, China.
Weinan E
Weinan E
Professor of Mathematics, Princeton University
applied mathematics
Weijie Zhang
Weijie Zhang
University of Kansas Medical Center
Inverse planningparticle therapy
H
Han Wen
DP Technology Co., Ltd. Beijing, China.; AI for Science Institute, Beijing, China.; State Key Laboratory of Medical Proteomics, Beijing 102206, China.
Hao Chi
Hao Chi
Institute of Computing Technology, Chinese Academy of Sciences
computational proteomics