A Text-guided Protein Design Framework

📅 2023-02-09

🏛️ arXiv.org

📈 Citations: 50

✨ Influential: 4

career value

162K/year

🤖 AI Summary

Existing AI-based protein design methods primarily rely on sequence and structural information, overlooking the vast reservoir of functional knowledge encoded in textual descriptions. This work introduces ProteinDT, a novel multimodal framework that pioneers the integration of human-written protein functional text into end-to-end protein design, establishing a three-stage paradigm: cross-modal alignment, text-driven representation generation, and autoregressive sequence decoding. We construct SwissProtCLAP—the first large-scale text-protein paired dataset (441K pairs)—and propose ProteinCLAP, a model enabling fine-grained semantic alignment between textual descriptions and protein representations. ProteinDT supports zero-shot, function-guided editing and high-fidelity generation: it achieves >90% text-guided generation accuracy and attains state-of-the-art performance on 4 of 6 property prediction benchmarks, while outperforming all baselines across 12 zero-shot editing tasks.

📝 Abstract

Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Protein Design

Textual Information

Artificial Intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

ProteinDT

Text-guided Protein Design

SwissProtCLAP Dataset

🔎 Similar Papers

No similar papers found.