๐ค AI Summary
This work addresses the issue of vocabulary collapse in existing equivariant graph neural networkโbased methods for antibody CDR design, which often over-predict a limited set of amino acids and neglect functionally critical residues. To mitigate this, the authors propose a novel architecture that integrates a frozen protein language model with an E(3)-equivariant graph neural network, leveraging evolutionary and 3D structural priors through a cross-attention adapter. The approach further incorporates a progressive unfreezing strategy and R-Drop consistency regularization to alleviate representation collapse. Evaluated on CHIMERA-Bench, the method achieves a 16% improvement in sequence recovery rate, a 43% reduction in perplexity, and a 2.3-fold increase in amino acid diversity, while attaining state-of-the-art performance in binding pair correlation.
๐ Abstract
Equivariant graph neural network (GNN) methods for antibody complementarity-determining region (CDR) design achieve the highest sequence recovery but suffer from severe vocabulary collapse. The current best GNN methods over-predict very few amino acids, such as tyrosine and glycine, while ignoring functionally important residues. We trace this failure to GNN encoders learning amino acid distributions de novo from limited structural data, discarding substitution patterns encoded in evolutionary databases. To resolve this, we propose EvoStruct, which bridges a frozen protein language model (PLM) with 3D structural context from an E(3)-equivariant GNN via a cross-attention adapter. Unlike prior PLM-structure adapters for general protein design, EvoStruct targets the vocabulary collapse problem specific to CDR design through progressive PLM unfreezing and R-Drop consistency regularization. On the CHIMERA-Bench dataset, EvoStruct achieves the highest amino acid recovery and lowest perplexity among several antibody design methods, improving sequence recovery by 16% and reducing perplexity by 43% relative to the best GNN baselines, while recovering 2.3x greater amino acid diversity and the highest binding-pair correlation with ground truth.