GLProtein: Global-and-Local Structure Aware Protein Representation Learning

๐Ÿ“… 2025-05-17
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing protein representation methods struggle to jointly model global structural similarity and local amino acid-level details, limiting functional prediction accuracy. To address this, we propose the first end-to-end, cross-scale protein structure-aware learning framework that unifies atomic-level 3D distance modeling, substructure-level molecular encoding, and inter-protein structural similarity ranking. Our method introduces a joint optimization paradigm combining masked language modeling with triplet-based structural scoring, integrating contrastive learning and self-supervised pretraining. Extensive experiments demonstrate significant improvements over state-of-the-art methods on contact map prediction, proteinโ€“protein interaction prediction, and other downstream tasks. Ablation studies confirm that the synergistic modeling of global and local structural information substantially enhances structural awareness and generalization capability across diverse protein-related tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Proteins are central to biological systems, participating as building blocks across all forms of life. Despite advancements in understanding protein functions through protein sequence analysis, there remains potential for further exploration in integrating protein structural information. We argue that the structural information of proteins is not only limited to their 3D information but also encompasses information from amino acid molecules (local information) to protein-protein structure similarity (global information). To address this, we propose extbf{GLProtein}, the first framework in protein pre-training that incorporates both global structural similarity and local amino acid details to enhance prediction accuracy and functional insights. GLProtein innovatively combines protein-masked modelling with triplet structure similarity scoring, protein 3D distance encoding and substructure-based amino acid molecule encoding. Experimental results demonstrate that GLProtein outperforms previous methods in several bioinformatics tasks, including predicting protein-protein interaction, contact prediction, and so on.
Problem

Research questions and friction points this paper is trying to address.

Integrating global and local protein structural information for representation learning
Enhancing protein function prediction accuracy through structural awareness
Addressing limitations in current protein sequence analysis methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines masked modeling with triplet similarity scoring
Integrates 3D distance encoding for structural representation
Uses substructure-based amino acid molecule encoding
Yunqing Liu
Yunqing Liu
PhD Candidate, The Hong Kong Polytechnic University (PolyU)
AI4S
W
Wenqi Fan
Department of Computing (COMP), The Hong Kong Polytechnic University
X
Xiaoyong Wei
Department of Computer Science, Sichuan University
Q
Qing Li
Department of Computing (COMP), The Hong Kong Polytechnic University