🤖 AI Summary
This work addresses the limitation in protein secondary structure prediction—namely, the neglect of three-dimensional structural information. To this end, we propose SSRGNet: a hybrid architecture that first constructs a relation-enhanced residue graph from known protein 3D structures, explicitly encoding spatial adjacency and geometric constraints. It then jointly integrates a Transformer-based pretrained language model (to capture sequential semantics) with a relation-aware multi-layer Relational Graph Convolutional Network (R-GCN) (to model structural dependencies), augmented by a local-region convolution mechanism to strengthen short-range interaction modeling. Crucially, SSRGNet is the first method to synergistically leverage large-scale unlabeled sequences and limited yet highly informative 3D structural data for secondary structure prediction. On the NetSurfP-2.0 benchmark, SSRGNet achieves statistically significant improvements in F1-score over state-of-the-art methods for both three-state (Q3) and eight-state (Q8) prediction tasks, empirically validating the efficacy of sequence–structure joint modeling.
📝 Abstract
In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.