CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current protein language models (PLMs) struggle to jointly satisfy multimodal, multi-granularity constraints—including functional annotations (GO/EC/IPR), sequence features, and 3D structural properties. To address this, we propose the first diffusion-based language model framework enabling cross-modal joint guidance for de novo protein design. Our approach innovatively integrates annotation-guided functional modulation (AGFM) with residue-level functional encoding (RCFE), and incorporates a plug-and-play 3D structure encoder—enabling composable, controllable, and functionally multifaceted protein generation. Experiments demonstrate that generated proteins match natural counterparts in functional performance, with significantly improved success rates in multifunctional design. The framework achieves high throughput, high accuracy, and intrinsic interpretability. By unifying functional, sequential, and structural constraints within a single generative paradigm, our method establishes a new foundation for multi-constrained protein engineering.

Technology Category

Application Category

📝 Abstract
Existing PLMs generate protein sequences based on a single-condition constraint from a specific modality, struggling to simultaneously satisfy multiple constraints across different modalities. In this work, we introduce CFP-Gen, a novel diffusion language model for Combinatorial Functional Protein GENeration. CFP-Gen facilitates the de novo protein design by integrating multimodal conditions with functional, sequence, and structural constraints. Specifically, an Annotation-Guided Feature Modulation (AGFM) module is introduced to dynamically adjust the protein feature distribution based on composable functional annotations, e.g., GO terms, IPR domains and EC numbers. Meanwhile, the Residue-Controlled Functional Encoding (RCFE) module captures residue-wise interaction to ensure more precise control. Additionally, off-the-shelf 3D structure encoders can be seamlessly integrated to impose geometric constraints. We demonstrate that CFP-Gen enables high-throughput generation of novel proteins with functionality comparable to natural proteins, while achieving a high success rate in designing multifunctional proteins. Code and data available at https://github.com/yinjunbo/cfpgen.
Problem

Research questions and friction points this paper is trying to address.

Generates proteins with multiple cross-modal constraints
Integrates functional, sequence, and structural constraints
Enables high-throughput multifunctional protein design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion model integrates multimodal protein constraints
AGFM module dynamically adjusts protein features
RCFE module ensures precise residue control
Junbo Yin
Junbo Yin
KAUST; BEIJING INSTITUTE OF TECHNOLOGY; EPFL
3D VisionMultimodal LearningProtein Design
C
Chao Zha
Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
W
Wenjia He
Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
C
Chencheng Xu
Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
X
Xin Gao
Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia