🤖 AI Summary
Current DIA mass spectrometry analysis methods rely on within-run semi-supervised rescoring, which is prone to overfitting and exhibits limited generalizability. This work proposes DIA-CLIP, the first approach to introduce universal cross-modal representation learning into DIA proteomics. By leveraging a dual-encoder contrastive learning framework combined with an encoder–decoder architecture, DIA-CLIP constructs a unified embedding space for peptides and mass spectra, enabling zero-shot, high-accuracy peptide–spectrum matching without requiring within-run training. The method significantly outperforms existing tools across multiple benchmarks, achieving up to a 45% increase in protein identifications and a 12% reduction in decoy rates. Furthermore, DIA-CLIP demonstrates strong applicability to emerging frontiers such as single-cell and spatial proteomics.
📝 Abstract
Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semi-supervised training to universal cross-modal representation learning. By integrating dual-encoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.