Conditional Enzyme Generation Using Protein Language Models with Adapters

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing protein generation methods struggle to generalize to unseen enzyme functions (EC numbers) and taxonomic classes. To address this, we propose ProCALM—a novel framework enabling the first joint conditional modeling of enzyme function and taxonomy. Built upon ProGen2, ProCALM incorporates lightweight adapter modules that unify heterogeneous conditioning signals—including enzyme family identifiers, taxonomic labels, and natural language descriptions—into structured embeddings, and explicitly models their joint distribution. Experiments demonstrate that ProCALM matches state-of-the-art methods in generating high-fidelity sequences within target enzyme families; it is the first method to support *joint* controllable generation across both enzyme families and species. Moreover, ProCALM significantly improves generalization to rare and unseen enzyme classes, overcoming key limitations of prompt-based approaches in conditional expressivity and out-of-distribution extrapolation.

Technology Category

Application Category

📝 Abstract
The conditional generation of proteins with desired functions and/or properties is a key goal for generative models. Existing methods based on prompting of language models can generate proteins conditioned on a target functionality, such as a desired enzyme family. However, these methods are limited to simple, tokenized conditioning and have not been shown to generalize to unseen functions. In this study, we propose ProCALM (Protein Conditionally Adapted Language Model), an approach for the conditional generation of proteins using adapters to protein language models. Our specific implementation of ProCALM involves finetuning ProGen2 to incorporate conditioning representations of enzyme function and taxonomy. ProCALM matches existing methods at conditionally generating sequences from target enzyme families. Impressively, it can also generate within the joint distribution of enzymatic function and taxonomy, and it can generalize to rare and unseen enzyme families and taxonomies. Overall, ProCALM is a flexible and computationally efficient approach, and we expect that it can be extended to a wide range of generative language models.
Problem

Research questions and friction points this paper is trying to address.

Conditional protein generation for desired functions
Overcoming limitations of simple tokenized conditioning
Generalizing to rare and unseen protein functions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses adapters for protein language models
Finetunes ProGen2 for versatile function conditioning
Generalizes to rare and unseen protein functions
Jason Yang
Jason Yang
Massachussetts Institute of Technology
algorithmscomplexity theory
A
Aadyot Bhatnagar
Profluent Bio
J
Jeffrey A. Ruffolo
Profluent Bio
Ali Madani
Ali Madani
Profluent Bio