🤖 AI Summary
This work proposes the first end-to-end, genome-driven framework for predicting multidimensional microbial physiological boundaries—such as temperature, pH, and salinity—addressing the limitations of traditional approaches that rely heavily on labor-intensive in vitro assays and the inability of existing computational models to effectively bridge genotype and physiological phenotype. The framework integrates a genome-informed large language model agent with LucaOne genomic embeddings, retrieval-augmented generation (RAG), and genome-scale metabolic models (GEMs), enhanced by a counterfactual gene-anchored reward mechanism and a dynamic tool-calling strategy. Trained through a three-stage pipeline—gene–text alignment, supervised fine-tuning, and GRPO optimization—the resulting 4B-parameter agent matches or exceeds the performance of significantly larger models across multiple tasks, with ablation studies confirming the contribution of each component.
📝 Abstract
Characterizing the physiological life boundaries of microbial strains, including viable temperature, pH, salinity, substrate utilization, and morphology, is central to biotechnology and ecology, yet traditionally requires exhaustive in vitro screening. Existing computational approaches either treat physiological traits as isolated supervised targets or repurpose biological foundation models as static encoders, leaving the genotype-to-physiology gap largely unbridged. We formulate microbial life-boundary prediction as a unified genome-to-physiology task and address it with a genome-conditioned, tool-augmented LLM agent. To support this task, we curate a strain-centric benchmark from IJSEM, NCBI, and BacDive covering 1,525 strains and 6,448 instances across viability intervals, environmental optima, substrate utilization, categorical traits, and morphology. Architecturally, the agent injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion, and reasons over a similarity-based RAG module and a Genome-scale Metabolic Model (GEM) perturbation tool. We optimize the agent through a three-stage pipeline of gene-text alignment, agentic SFT on distilled trajectories, and GRPO with a novel counterfactual gene-grounding reward that reinforces the policy only when the authentic genome embedding causally improves correct-token generation relative to a zero-gene ablation. The resulting 4B-parameter agent matches or surpasses substantially larger frontier LLMs, with ablations confirming that genome-token fusion, dynamic tool use, and the counterfactual reward each yield distinct, significant gains.