🤖 AI Summary
Functional de novo protein design suffers from poor sequence–structure reliability and low foldability. Method: We propose a dynamic protein vocabulary mechanism that, given functional text descriptions, retrieves and integrates natural protein structural fragments in real time to jointly optimize structural plausibility and functional specificity. Our approach integrates a text encoder, a protein language model, and a differentiable fragment retrieval module, trained end-to-end using pLDDT and PAE as structural evaluation metrics. Results: With only 0.04% of the training data, our method achieves state-of-the-art functional alignment performance. It increases the proportion of designs with pLDDT > 70 by 7.38% and those with PAE < 10 by 9.6%, significantly improving foldability and functional accuracy. This work introduces dynamic fragment retrieval into the text-to-protein generation paradigm for the first time, establishing a new pathway toward controllable and trustworthy generative protein design.
📝 Abstract
Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.