🤖 AI Summary
This work proposes a novel approach to protein language modeling that departs from the conventional reliance solely on masked language modeling (MLM) under a fixed computational budget. For the first time, it integrates the Joint-Embedding Predictive Architecture (JEPA) into protein sequence modeling by jointly optimizing MLM and latent variable prediction exclusively at masked positions. This strategy enhances model representational capacity without incurring additional training costs. Empirical evaluation across 16 downstream tasks demonstrates consistent performance gains: ESM2-35M achieves 10 wins, 3 losses, and 3 ties, while ESM2-150M attains 11 wins, 2 losses, and 3 ties. Notably, the models exhibit superior performance in critical tasks such as protein stability prediction, enzyme classification, and remote homology detection.
📝 Abstract
Protein language models are trained primarily with masked language modeling (MLM), which predicts amino-acid identities at masked positions. We ask whether latent-space prediction can complement these token-level objectives under matched wall-clock budget. Across pretrained and random-init protein sequence encoders at 35--150M parameters, we find that the best protein-JEPA design is not all-position latent prediction but a variant: predicting latent targets only at masked positions, and retaining the MLM cross-entropy. We call this recipe masked-position MLM+JEPA. On a 16-task downstream suite (15 frozen linear probes plus SCOPe-40 zero-shot fold retrieval), under matched wall-clock budgets, this recipe wins more tasks than it loses against MLM-only continuation: 10 wins / 3 losses / 3 ties (hereafter W/L/T) on pretrained ESM2-35M, 11/2/3 on ESM2-150M while results in pretraining from scratch are mixed (6/8/2). Gains are seen for multiple models on 11 of 16 tasks, including stability, \b{eta}β\b{eta}-lactamase fitness, variant effect, intrinsic disorder, remote homology, enzyme classification, and SCOPe-40 fold retrieval. Tasks with more losses than wins are Fluorescence (TAPE) and Peptide-HLA Binding. All-position MLM+JEPA matches MLM-only overall but does not reproduce the masked-position gains. JEPA-only (no MLM) collapses in nearly every experiment. We conclude that JEPA, when combined with MLM, is competitive and can outperform pure MLM in pretraining and continued training, even under matched wall-clock budgets.