🤖 AI Summary
Inverse protein folding faces challenges in modeling highly uncertain regions—such as loops and intrinsically disordered segments—including low sequence prediction accuracy and poor uncertainty calibration. To address these, we propose the first structure-guided discrete diffusion framework that integrates a masked-prior pre-trained graph neural network with Monte Carlo Dropout, explicitly modeling joint residue-backbone dependencies to enhance uncertainty quantification. Our method conditions sequence generation on the protein backbone and employs a denoising diffusion process for precise residue assignment. Evaluated on four major sequence design benchmarks, it significantly outperforms state-of-the-art methods. Generated sequences faithfully recapitulate native proteins’ physicochemical properties and 3D structural features, while covering diverse folds and protein families—demonstrating strong generalization capability and biological plausibility.
📝 Abstract
Inverse protein folding generates valid amino acid sequences that can fold into a desired protein structure, with recent deep-learning advances showing significant potential and competitive performance. However, challenges remain in predicting highly uncertain regions, such as those with loops and disorders. To tackle such low-confidence residue prediction, we propose a extbf{Ma}sk extbf{p}rior-guided denoising extbf{Diff}usion ( extbf{MapDiff}) framework that accurately captures both structural and residue interactions for inverse protein folding. MapDiff is a discrete diffusion probabilistic model that iteratively generates amino acid sequences with reduced noise, conditioned on a given protein backbone. To incorporate structural and residue interactions, we develop a graph-based denoising network with a mask prior pre-training strategy. Moreover, in the generative process, we combine the denoising diffusion implicit model with Monte-Carlo dropout to improve uncertainty estimation. Evaluation on four challenging sequence design benchmarks shows that MapDiff significantly outperforms state-of-the-art methods. Furthermore, the in-silico sequences generated by MapDiff closely resemble the physico-chemical and structural characteristics of native proteins across different protein families and architectures.