HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the loss of fine-grained structural information caused by discretizing continuous 3D coordinates in joint protein sequence–structure modeling, this paper introduces HD-Prot, the first hybrid diffusion protein language model enabling co-training of discrete sequence tokens and continuous structure tokens. Methodologically, HD-Prot operates within a unified absorbing diffusion framework that directly models continuous structural latent variables—eliminating reliance on traditional vector quantization—and jointly estimates categorical sequence prediction and continuous-structure diffusion in a single architecture. Experimentally, HD-Prot achieves state-of-the-art performance across multimodal protein language modeling tasks, including unconditional sequence–structure co-generation, motif scaffolding, structure prediction, and inverse folding. Notably, it attains these advances while significantly reducing training computational overhead compared to prior diffusion- or VQ-based approaches.

Technology Category

Application Category

📝 Abstract
Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive empirical results show that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing on par with state-of-the-art multimodal pLMs despite being developed under limited computational resources. It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.
Problem

Research questions and friction points this paper is trying to address.

Integrating continuous protein structure data into discrete language models.
Avoiding information loss from discretizing structures in multimodal protein models.
Enabling joint sequence-structure modeling with both discrete and continuous tokens.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses continuous structure tokens to avoid discretization loss.
Combines discrete and continuous tokens via hybrid diffusion model.
Unifies sequence and structure modeling with absorbing diffusion process.
🔎 Similar Papers
No similar papers found.
Y
Yi Zhou
The Hong Kong Polytechnic University
H
Haohao Qu
The Hong Kong Polytechnic University
Yunqing Liu
Yunqing Liu
PhD Candidate, The Hong Kong Polytechnic University (PolyU)
AI4S
S
Shanru Lin
The Hong Kong Polytechnic University
Le Song
Le Song
CTO, GenBio AI; Professor, MBZUAI
AIAI for ScienceMachine Learning
W
Wenqi Fan
The Hong Kong Polytechnic University