Protein Design with Dynamic Protein Vocabulary

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Functional de novo protein design suffers from poor sequence–structure reliability and low foldability. Method: We propose a dynamic protein vocabulary mechanism that, given functional text descriptions, retrieves and integrates natural protein structural fragments in real time to jointly optimize structural plausibility and functional specificity. Our approach integrates a text encoder, a protein language model, and a differentiable fragment retrieval module, trained end-to-end using pLDDT and PAE as structural evaluation metrics. Results: With only 0.04% of the training data, our method achieves state-of-the-art functional alignment performance. It increases the proportion of designs with pLDDT > 70 by 7.38% and those with PAE < 10 by 9.6%, significantly improving foldability and functional accuracy. This work introduces dynamic fragment retrieval into the text-to-protein generation paradigm for the first time, establishing a new pathway toward controllable and trustworthy generative protein design.

Technology Category

Application Category

📝 Abstract
Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.
Problem

Research questions and friction points this paper is trying to address.

Enhancing protein foldability in generative models
Integrating natural protein fragments for structural plausibility
Designing functional and well-folded proteins efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic fragment retrieval enhances protein foldability
Integrates text encoder and protein language model
Achieves high function alignment with minimal data
N
Nuowei Liu
School of Computer Science and Technology, East China Normal University
J
Jiahao Kuang
School of Computer Science and Technology, East China Normal University
Y
Yanting Liu
School of Computer Science and Technology, East China Normal University
Changzhi Sun
Changzhi Sun
Institute of Artificial Intelligence (TeleAI), China Telecom
Machine LearningNatural Language ProcessingAI for Science
Tao Ji
Tao Ji
中国人民大学
Y
Yuanbin Wu
School of Computer Science and Technology, East China Normal University
Man Lan
Man Lan
East China Normal University,School of Computer Science and Technology
NLP