🤖 AI Summary
This study addresses the challenge of natural language–guided de novo design of ligand-binding proteins. To reconcile the scarcity of high-quality protein–ligand complex structural data with the abundance of textual descriptions, we propose a novel language–ligand–protein ternary instruction-tuning paradigm and introduce InstructProBench, a large-scale instruction dataset comprising 9.59 million samples. Leveraging a Transformer architecture, we develop InstructPro-1B and InstructPro-3B models that jointly encode SMILES representations and natural language instructions via multimodal alignment training. Experimental results demonstrate that InstructPro-1B achieves an 81.52% docking success rate (RMSD ≤ 4.026 Å) at medium confidence, while InstructPro-3B further reduces RMSD to 2.527 Å—substantially outperforming ProGen2, ESM3, and Pinal. To our knowledge, this work represents the first framework enabling controllable, interpretable, and high-accuracy language-driven functional protein design.
📝 Abstract
Can AI protein models follow human language instructions and design proteins with desired functions (e.g. binding to a ligand)? Designing proteins that bind to a given ligand is crucial in a wide range of applications in biology and chemistry. Most prior AI models are trained on protein-ligand complex data, which is scarce due to the high cost and time requirements of laboratory experiments. In contrast, there is a substantial body of human-curated text descriptions about protein-ligand interactions and ligand formula. In this paper, we propose InstructPro, a family of protein generative models that follow natural language instructions to design ligand-binding proteins. Given a textual description of the desired function and a ligand formula in SMILES, InstructPro generates protein sequences that are functionally consistent with the specified instructions. We develop the model architecture, training strategy, and a large-scale dataset, InstructProBench, to support both training and evaluation. InstructProBench consists of 9,592,829 triples of (function description, ligand formula, protein sequence). We train two model variants: InstructPro-1B (with 1 billion parameters) and InstructPro-3B~(with 3 billion parameters). Both variants consistently outperform strong baselines, including ProGen2, ESM3, and Pinal. Notably, InstructPro-1B achieves the highest docking success rate (81.52% at moderate confidence) and the lowest average root mean square deviation (RMSD) compared to ground truth structures (4.026{AA}). InstructPro-3B further descreases the average RMSD to 2.527{AA}, demonstrating InstructPro's ability to generate ligand-binding proteins that align with the functional specifications.