ProtTeX: Structure-In-Context Reasoning and Editing of Proteins with Large Language Models

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Protein large language models (LLMs) have long been constrained by sequence-only inputs, limiting their capacity to capture deep structure–function relationships. This work introduces the first structure-aware multimodal protein tokenization framework that jointly maps amino acid sequences, 3D atomic coordinates, and textual descriptions into a shared discrete token space—enabling standard text-based LLMs to natively understand, reason about, and generate protein structures. Our approach comprises: (i) structure-guided discrete tokenization; (ii) a unified next-token prediction training paradigm; (iii) RMSD-aware text–structure alignment encoding; and (iv) LLM-based end-to-end structural editing. Experiments demonstrate a twofold improvement in functional prediction accuracy over domain-specific SOTA methods; high-fidelity backbone conformation generation; and residue-level programmable design. Notably, we are the first to empirically validate that off-the-shelf decoder-only LLMs—without architectural modification—can jointly solve diverse tasks including structural understanding, functional prediction, conformation generation, and rational design.

Technology Category

Application Category

📝 Abstract
Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProtTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProtTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProtTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses structure-dependent challenges in protein science using LLMs.
Introduces ProTeX for unified tokenization of protein sequences and structures.
Enhances protein function prediction and customizable protein design.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified tokenization of protein sequences and structures
Joint training via Next-Token Prediction paradigm
Multimodal protein reasoning and generation capabilities
🔎 Similar Papers
No similar papers found.
Zicheng Ma
Zicheng Ma
Peking University
BiophysicsBioinformaticsDeep learning
C
Chuanliu Fan
School of Computer Science and Technology, Soochow University, Suzhou, China
Z
Zhicong Wang
School of Computer Science and Technology, Soochow University, Suzhou, China
Z
Zhenyu Chen
Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China.
X
Xiaohan Lin
Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China.
Yanheng Li
Yanheng Li
City University of Hong Kong
Human-Computer InteractionHuman-Robot interaction
S
Shihao Feng
Changping Laboratory, Beijing 102200, China.
J
Jun Zhang
Changping Laboratory, Beijing 102200, China.
Ziqiang Cao
Ziqiang Cao
Soochow University
Natural Language Processing
Y
Y. Gao
Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China.