Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional protein prediction tasks rely on task-specific models, resulting in poor generalizability and high computational overhead. To address this, we propose UniProt—the first unified protein language modeling framework. It reformulates diverse tasks—including sequence-level property prediction, residue-level modeling, and protein–protein interaction prediction—into a unified “task-prompted next-token prediction” paradigm. We introduce the first standardized, biology-aware tokenization scheme for protein sequences and design learnable task tokens alongside a spatially aware auxiliary decoder pretraining strategy. UniProt employs an autoregressive decoder architecture, integrating pretrained encoder embeddings with a multi-task prompting mechanism. On multiple benchmarks, it matches or surpasses specialized models, notably improving accuracy on structure-sensitive tasks. Moreover, inference is nearly 1,000× faster than AlphaFold2 (including MSA generation), achieving unprecedented efficiency and cross-task generalization.

Technology Category

Application Category

📝 Abstract
The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .
Problem

Research questions and friction points this paper is trying to address.

Unifying diverse protein prediction tasks into a single framework
Overcoming inefficiency of specialized Protein Language Models (PLMs)
Enabling multi-task learning for improved computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified protein modeling via next-token prediction
Autoregressive decoder with task-specific embeddings
Self-supervised pre-training for spatial sensitivity
🔎 Similar Papers
No similar papers found.
M
Mahdi Pourmirzaei
University of Missouri
F
Farzaneh Esmaili
University of Missouri
S
Salhuldin Alqarghuli
University of Missouri
M
Mohammadreza Pourmirzaei
Politecnico di Milano, Milan, Italy
Ye Han
Ye Han
Doctor Candidate, Tongji University
Artifical IntelligenceReinforcement LearningAutonomous DrivingDecision MakingGame Theory
K
Kai Chen
University of Missouri
M
Mohsen Rezaei
University of Missouri
Duolin Wang
Duolin Wang
research scientist of bioinformatics, University of Missouri
machine learningDNA RNA protein sequence analysis
D
Dong Xu
University of Missouri