ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

πŸ“… 2024-02-26
πŸ“ˆ Citations: 28
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
Existing protein language models (PLMs) exhibit a fundamental trade-off between understanding (PLU) and generation (PLG) capabilities, hindering advances in protein engineering. To address this, we propose the first unified multitask protein language model that jointly optimizes both PLU and PLG. Our approach comprises three key innovations: (1) a lightweight, general-purpose large language model (LLM) adaptation framework tailored for PLMs; (2) Protein Vocabulary Pruning (PVP), a novel technique that improves modeling efficiency by eliminating redundant amino acid tokenizations; and (3) a large-scale, multitask instruction dataset comprising 13 million samplesβ€”the first of its kind enabling end-to-end PLU/PLG co-training. Experiments demonstrate state-of-the-art performance in unconditional protein sequence generation, support for function-controllable generation, and 62% accuracy on protein superfamily classification. All code, model weights, and data are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) have achieved remarkable performance in multiple Natural Language Processing (NLP) tasks. Under the premise that protein sequences constitute the protein language, Protein Language Models(PLMs) have advanced the field of protein engineering. However, as of now, unlike LLMs in NLP, PLMs cannot handle the protein understanding task and the protein generation task simultaneously in the Protein Language Processing (PLP) field. This prompts us to delineate the inherent limitations in current PLMs: (i) the lack of natural language capabilities, (ii) insufficient instruction understanding, and (iii) high training resource demands. To address these challenges, we introduce a training framework to transform any general LLM into a PLM capable of handling multiple PLP tasks. To improve training efficiency, we propose Protein Vocabulary Pruning (PVP) for general LLMs. We construct a multi-task instruction dataset containing 13 million samples with superfamily information, facilitating better modeling of protein sequence-function landscapes. Through these methods, we develop the ProLLaMA model, the first known PLM to handle multiple PLP tasks simultaneously. Experiments show that ProLLaMA achieves state-of-the-art results in the unconditional protein sequence generation task. In the controllable protein sequence generation task, ProLLaMA can design novel proteins with desired functionalities. As for the protein understanding task, ProLLaMA achieves a 62% exact match rate in superfamily prediction. Codes, model weights, and datasets are available at url{https://github.com/PKU-YuanGroup/ProLLaMA} and url{https://huggingface.co/GreatCaptainNemo}.
Problem

Research questions and friction points this paper is trying to address.

Addressing protein language model fragmentation in understanding and generation
Bridging the gap between protein language understanding and generation tasks
Enhancing biological viability of generated protein sequences through innovations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multitask protein language model with evolutionary framework
Two-stage training approach for protein domain expertise
Multi-dimensional scorer with hierarchical decoding mechanism
πŸ”Ž Similar Papers
No similar papers found.
Liuzhenghao Lv
Liuzhenghao Lv
Phd student Computer Science, Peking University
Large Language ModelsAI for ScienceSpiking Neural Networks
Z
Zongying Lin
School of Electronic and Computer Engineering, Peking University
H
Hao Li
School of Electronic and Computer Engineering, Peking University
Y
Yuyang Liu
School of Electronic and Computer Engineering, Peking University
Jiaxi Cui
Jiaxi Cui
School of Electronic and Computer Engineering, Peking University
C
Calvin Yu-Chian Chen
School of Electronic and Computer Engineering, Peking University
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants
Y
Yonghong Tian
School of Electronic and Computer Engineering, Peking University