🤖 AI Summary
Protein inverse folding—designing functional sequences from target 3D structures—remains a central challenge in computational protein engineering. Existing approaches either neglect evolutionary information or rely on parameter-heavy, computationally expensive protein language models (PLMs) with limited scalability. We propose RadDiff, a retrieval-augmented denoising diffusion model that leverages structural similarity to retrieve nearby templates from large-scale protein databases, then constructs position-specific amino acid profiles as conditional priors to guide sequence generation. RadDiff integrates hierarchical structural retrieval, residue-level alignment, a lightweight ensemble module, and diffusion-based modeling, achieving superior parameter efficiency and generalization. On CATH, PDB, and TS50 benchmarks, RadDiff improves sequence recovery rates by up to 19% over prior methods; generated sequences exhibit high foldability, and performance scales consistently with database size.
📝 Abstract
Protein inverse folding, the design of an amino acid sequence based on a target 3D structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models (PLMs). The former omits the evolutionary information stored in protein databases, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called retrieval-augmented denoising diffusion (RadDiff), for protein inverse folding. Given the target protein backbone, RadDiff uses a hierarchical search strategy to efficiently retrieve structurally similar proteins from large protein databases. The retrieved structures are then aligned residue-by-residue to the target to construct a position-specific amino acid profile, which serves as an evolutionary-informed prior that conditions the denoising process. A lightweight integration module is further designed to incorporate this prior effectively. Experimental results on the CATH, PDB, and TS50 datasets show that RadDiff consistently outperforms existing methods, improving sequence recovery rate by up to 19%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.