🤖 AI Summary
Accelerating the generation of mechanistic, testable scientific hypotheses in protein design—particularly for metalloproteins such as ferredoxins—remains a major challenge due to the complexity of sequence–structure–function relationships.
Method: We introduce Genie-CAT, a tool-augmented, agent-based large language model (LLM) framework integrating retrieval-augmented generation (RAG), high-accuracy protein structure modeling, electrostatic potential computation, and machine learning–based prediction of redox properties to enable synergistic symbolic reasoning and physics-informed simulation.
Contribution/Results: Unlike conventional conversational LLMs, Genie-CAT operates as an autonomous scientific agent that constructs interpretable, experimentally verifiable mechanistic hypotheses. In benchmarking on iron–sulfur proteins, it recapitulated expert-level inference within hours, autonomously identifying key residues modulating [Fe–S] cluster redox behavior. Hypothesis generation throughput improved by over an order of magnitude compared to manual approaches.
📝 Abstract
We present Genie-CAT, a tool-augmented large-language-model (LLM) system designed to accelerate scientific hypothesis generation in protein design. Using metalloproteins (e.g., ferredoxins) as a case study, Genie-CAT integrates four capabilities -- literature-grounded reasoning through retrieval-augmented generation (RAG), structural parsing of Protein Data Bank files, electrostatic potential calculations, and machine-learning prediction of redox properties -- into a unified agentic workflow. By coupling natural-language reasoning with data-driven and physics-based computation, the system generates mechanistically interpretable, testable hypotheses linking sequence, structure, and function. In proof-of-concept demonstrations, Genie-CAT autonomously identifies residue-level modifications near [Fe--S] clusters that affect redox tuning, reproducing expert-derived hypotheses in a fraction of the time. The framework highlights how AI agents combining language models with domain-specific tools can bridge symbolic reasoning and numerical simulation, transforming LLMs from conversational assistants into partners for computational discovery.