🤖 AI Summary
Existing mass spectrometry de novo sequencing methods predominantly employ deep learning solely for feature extraction, lack a unified scoring framework, and suffer from significant limitations in zero-shot de novo sequencing, large-scale post-translational modification (PTM) support, and peptide identification sensitivity. This work introduces the first proteomics-oriented multimodal pre-trained model, which achieves end-to-end peptide–spectrum matching and open-vocabulary zero-shot de novo sequencing via spectral–sequence cross-modal alignment. The model supports over 1,300 PTMs and maintains substantially improved identification performance even under expanded search spaces. Experiments demonstrate a 42.6% increase in identified peptides in immunopeptidomics, a 60% improvement in peptide-spectrum matches (PSMs) over state-of-the-art de novo methods, and successful recovery of peptides from 38.5% of low-quality spectra—including 1,891 novel genomic-mapping sequences absent from reference proteomes.
📝 Abstract
Deep learning has advanced mass spectrometry data interpretation, yet most models remain feature extractors rather than unified scoring frameworks. We present pUniFind, the first large-scale multimodal pre-trained model in proteomics that integrates end-to-end peptide-spectrum scoring with open, zero-shot de novo sequencing. Trained on over 100 million open search-derived spectra, pUniFind aligns spectral and peptide modalities via cross modality prediction and outperforms traditional engines across diverse datasets, particularly achieving a 42.6 percent increase in the number of identified peptides in immunopeptidomics. Supporting over 1,300 modifications, pUniFind identifies 60 percent more PSMs than existing de novo methods despite a 300-fold larger search space. A deep learning based quality control module further recovers 38.5 percent additional peptides including 1,891 mapped to the genome but absent from reference proteomes while preserving full fragment ion coverage. These results establish a unified, scalable deep learning framework for proteomic analysis, offering improved sensitivity, modification coverage, and interpretability.