Natural Language Guided Ligand-Binding Protein Design

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the challenge of natural language–guided de novo design of ligand-binding proteins. To reconcile the scarcity of high-quality protein–ligand complex structural data with the abundance of textual descriptions, we propose a novel language–ligand–protein ternary instruction-tuning paradigm and introduce InstructProBench, a large-scale instruction dataset comprising 9.59 million samples. Leveraging a Transformer architecture, we develop InstructPro-1B and InstructPro-3B models that jointly encode SMILES representations and natural language instructions via multimodal alignment training. Experimental results demonstrate that InstructPro-1B achieves an 81.52% docking success rate (RMSD ≤ 4.026 Å) at medium confidence, while InstructPro-3B further reduces RMSD to 2.527 Å—substantially outperforming ProGen2, ESM3, and Pinal. To our knowledge, this work represents the first framework enabling controllable, interpretable, and high-accuracy language-driven functional protein design.

Technology Category

Application Category

📝 Abstract

Can AI protein models follow human language instructions and design proteins with desired functions (e.g. binding to a ligand)? Designing proteins that bind to a given ligand is crucial in a wide range of applications in biology and chemistry. Most prior AI models are trained on protein-ligand complex data, which is scarce due to the high cost and time requirements of laboratory experiments. In contrast, there is a substantial body of human-curated text descriptions about protein-ligand interactions and ligand formula. In this paper, we propose InstructPro, a family of protein generative models that follow natural language instructions to design ligand-binding proteins. Given a textual description of the desired function and a ligand formula in SMILES, InstructPro generates protein sequences that are functionally consistent with the specified instructions. We develop the model architecture, training strategy, and a large-scale dataset, InstructProBench, to support both training and evaluation. InstructProBench consists of 9,592,829 triples of (function description, ligand formula, protein sequence). We train two model variants: InstructPro-1B (with 1 billion parameters) and InstructPro-3B~(with 3 billion parameters). Both variants consistently outperform strong baselines, including ProGen2, ESM3, and Pinal. Notably, InstructPro-1B achieves the highest docking success rate (81.52% at moderate confidence) and the lowest average root mean square deviation (RMSD) compared to ground truth structures (4.026{AA}). InstructPro-3B further descreases the average RMSD to 2.527{AA}, demonstrating InstructPro's ability to generate ligand-binding proteins that align with the functional specifications.

Problem

Research questions and friction points this paper is trying to address.

Designing proteins that bind to specific ligands using AI

Overcoming scarcity of protein-ligand complex data with text descriptions

Generating functional proteins from natural language instructions and SMILES

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI models follow language instructions for protein design

Generates proteins from text descriptions and ligand formulas

Large-scale dataset and model variants outperform baselines

🔎 Similar Papers

Improving Targeted Molecule Generation through Language Model Fine-Tuning Via Reinforcement Learning