🤖 AI Summary
Existing protein language models (pLMs) lack explicit modeling of three-dimensional structural knowledge, limiting their performance on structure-related tasks. To address this, we propose a dual-task structural alignment framework: (1) latent-space contrastive learning to align residue-level representations across proteins, and (2) a physics-informed structural token prediction task that jointly leverages intra- and inter-protein structural information. We further introduce a small-model-guided adaptive residue loss filtering mechanism to significantly enhance robustness against noisy structural inputs, and integrate graph neural network–language model joint distillation. Evaluated on ESM2 and AMPLIFY baselines, our method improves contact prediction accuracy by 12.7%. We publicly release the optimized models—SaESM2 and SaAMPLIFY—along with all training and inference code.
📝 Abstract
Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but lack the structural knowledge essential for many biological applications. To address this, we integrate structural insights from pre-trained protein graph neural networks (pGNNs) into pLMs through a latent-level contrastive learning task. This task aligns residue representations from pLMs with those from pGNNs across multiple proteins, enriching pLMs with inter-protein structural knowledge. Additionally, we incorporate a physical-level task that infuses intra-protein structural knowledge by optimizing pLMs to predict structural tokens. The proposed dual-task framework effectively incorporates both inter-protein and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module, which uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method to the state-of-the-art ESM2 and AMPLIFY results in notable performance gains across a wide range of tasks, including a 12.7% increase in ESM2 contact prediction. The data, code, and resulting SaESM2 and SaAMPLIFY models will be released on Hugging Face.