Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Accelerating the generation of mechanistic, testable scientific hypotheses in protein design—particularly for metalloproteins such as ferredoxins—remains a major challenge due to the complexity of sequence–structure–function relationships. Method: We introduce Genie-CAT, a tool-augmented, agent-based large language model (LLM) framework integrating retrieval-augmented generation (RAG), high-accuracy protein structure modeling, electrostatic potential computation, and machine learning–based prediction of redox properties to enable synergistic symbolic reasoning and physics-informed simulation. Contribution/Results: Unlike conventional conversational LLMs, Genie-CAT operates as an autonomous scientific agent that constructs interpretable, experimentally verifiable mechanistic hypotheses. In benchmarking on iron–sulfur proteins, it recapitulated expert-level inference within hours, autonomously identifying key residues modulating [Fe–S] cluster redox behavior. Hypothesis generation throughput improved by over an order of magnitude compared to manual approaches.

Technology Category

Application Category

📝 Abstract

We present Genie-CAT, a tool-augmented large-language-model (LLM) system designed to accelerate scientific hypothesis generation in protein design. Using metalloproteins (e.g., ferredoxins) as a case study, Genie-CAT integrates four capabilities -- literature-grounded reasoning through retrieval-augmented generation (RAG), structural parsing of Protein Data Bank files, electrostatic potential calculations, and machine-learning prediction of redox properties -- into a unified agentic workflow. By coupling natural-language reasoning with data-driven and physics-based computation, the system generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function. In proof-of-concept demonstrations, Genie-CAT autonomously identifies residue-level modifications near [Fe--S] clusters that affect redox tuning, reproducing expert-derived hypotheses in a fraction of the time. The framework highlights how AI agents combining language models with domain-specific tools can bridge symbolic reasoning and numerical simulation, transforming LLMs from conversational assistants into partners for computational discovery.

Problem

Research questions and friction points this paper is trying to address.

Accelerating scientific hypothesis generation for protein design

Generating mechanistically interpretable links between sequence, structure, and function

Autonomously identifying residue-level modifications affecting enzyme redox properties

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates RAG with structural parsing and calculations

Combines language reasoning with physics-based computation

Autonomously identifies residue modifications for redox tuning

🔎 Similar Papers

InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions