🤖 AI Summary
Current protein language models often employ mean-pooled sequence representations that are not explicitly optimized to capture functional, evolutionary, or structural similarities. This work proposes ProtSent, the first adaptation of the Sentence Transformer contrastive learning framework to protein representation learning. Building upon the ESM-2 architecture, ProtSent leverages MultipleNegativesRankingLoss and integrates multi-source protein pair data—including Pfam, AlphaFold DB, STRING, and deep mutational scanning datasets—for unsupervised fine-tuning. Evaluated across 23 downstream tasks, the ESM-2 150M model enhanced performance on 15 tasks, achieving a 105% improvement in remote homology detection and a 19.9% gain in Recall@1 for SCOPe-40 structural retrieval. The smaller 35M variant also outperformed baselines on 16 tasks, demonstrating a consistent and significant enhancement in the semantic quality of the embedding space.
📝 Abstract
Protein language models (pLMs) produce per-residue representations that capture evolutionary and structural information, yet their mean-pooled sequence embeddings are not explicitly trained to reflect functional, evolutionary or structural similarity between proteins. We present Protein Sentence Transformers (ProtSent), a contrastive fine-tuning framework for adapting PLMs into general-purpose embedding models. ProtSent trains with MultipleNegativesRankingLoss across five protein-pair datasets: Pfam families, structurally derived hard negatives, AlphaFold DB structural pairs, and StringDB protein--protein interactions, and Deep Mutational Scanning data. We evaluate on 23~downstream tasks using frozen embeddings with a k-nearest-neighbor probe to measure embedding neighborhood quality. On ESM-2 150M, ProtSent improves 15 of 23 tasks, with gains of +105% on remote homology detection, +17% on variant effect prediction, and +19.9% Recall@1 on SCOPe-40 structural retrieval. The 35M variant improves 16 of 23 tasks with +40.5% on remote homology and +15.5% Recall@1 on SCOPe-40. Contrastive fine-tuning restructures the embedding space to better capture protein function and structure, without any task-specific supervision. We release the models, public data, and training recipe and code.