A Text-guided Protein Design Framework

📅 2023-02-09
🏛️ arXiv.org
📈 Citations: 50
Influential: 4
📄 PDF

career value

168K/year
🤖 AI Summary
Existing AI-based protein design methods primarily rely on sequence and structural information, overlooking the vast reservoir of functional knowledge encoded in textual descriptions. This work introduces ProteinDT, a novel multimodal framework that pioneers the integration of human-written protein functional text into end-to-end protein design, establishing a three-stage paradigm: cross-modal alignment, text-driven representation generation, and autoregressive sequence decoding. We construct SwissProtCLAP—the first large-scale text-protein paired dataset (441K pairs)—and propose ProteinCLAP, a model enabling fine-grained semantic alignment between textual descriptions and protein representations. ProteinDT supports zero-shot, function-guided editing and high-fidelity generation: it achieves >90% text-guided generation accuracy and attains state-of-the-art performance on 4 of 6 property prediction benchmarks, while outperforming all baselines across 12 zero-shot editing tasks.
📝 Abstract
Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Protein Design
Textual Information
Artificial Intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

ProteinDT
Text-guided Protein Design
SwissProtCLAP Dataset
🔎 Similar Papers
No similar papers found.
S
Shengchao Liu
University of California Berkeley, Berkeley, CA 94720, United States; California Institute of Technology, Pasadena, CA 91125, United States
Y
Yutao Zhu
Université de Montréal, Montréal, QC H3T 1J4, Canada
J
Jiarui Lu
Université de Montréal, Montréal, QC H3T 1J4, Canada; Mila-Québec Artificial Intelligence Institute, Montréal, QC H2S 3H1, Canada
Z
Zhao Xu
Texas A&M University, Texas, TX 77843, United States
Weili Nie
Weili Nie
NVIDIA Research
Machine LearningDeep LearningGenerative Models
A
A. Gitter
University of Wisconsin-Madison, Madison, WI 53706, United States; Morgridge Institute for Research, Madison, WI 53715, United States
Chaowei Xiao
Chaowei Xiao
University of Wisconsin - Madison/NVIDIA
Trustworthy Machine LearningAdversarial Machine LearningAI SafetyRobust AISecurity
J
Jian Tang
Mila-Québec Artificial Intelligence Institute, Montréal, QC H2S 3H1, Canada; HEC Montréal, Montréal, QC H3T 2A7, Canada
Hongyu Guo
Hongyu Guo
Senior Research Scientist@NRC Canada, Adjunct Professor@University of Ottawa
machine learningdeep learninggeometric generative modelgraph network
Anima Anandkumar
Anima Anandkumar
California Institute of Technology and NVIDIA
Machine Learning and Artificial Intelligence