GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work proposes the first end-to-end, genome-driven framework for predicting multidimensional microbial physiological boundaries—such as temperature, pH, and salinity—addressing the limitations of traditional approaches that rely heavily on labor-intensive in vitro assays and the inability of existing computational models to effectively bridge genotype and physiological phenotype. The framework integrates a genome-informed large language model agent with LucaOne genomic embeddings, retrieval-augmented generation (RAG), and genome-scale metabolic models (GEMs), enhanced by a counterfactual gene-anchored reward mechanism and a dynamic tool-calling strategy. Trained through a three-stage pipeline—gene–text alignment, supervised fine-tuning, and GRPO optimization—the resulting 4B-parameter agent matches or exceeds the performance of significantly larger models across multiple tasks, with ablation studies confirming the contribution of each component.

📝 Abstract

Characterizing the physiological life boundaries of microbial strains, including viable temperature, pH, salinity, substrate utilization, and morphology, is central to biotechnology and ecology, yet traditionally requires exhaustive in vitro screening. Existing computational approaches either treat physiological traits as isolated supervised targets or repurpose biological foundation models as static encoders, leaving the genotype-to-physiology gap largely unbridged. We formulate microbial life-boundary prediction as a unified genome-to-physiology task and address it with a genome-conditioned, tool-augmented LLM agent. To support this task, we curate a strain-centric benchmark from IJSEM, NCBI, and BacDive covering 1,525 strains and 6,448 instances across viability intervals, environmental optima, substrate utilization, categorical traits, and morphology. Architecturally, the agent injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion, and reasons over a similarity-based RAG module and a Genome-scale Metabolic Model (GEM) perturbation tool. We optimize the agent through a three-stage pipeline of gene-text alignment, agentic SFT on distilled trajectories, and GRPO with a novel counterfactual gene-grounding reward that reinforces the policy only when the authentic genome embedding causally improves correct-token generation relative to a zero-gene ablation. The resulting 4B-parameter agent matches or surpasses substantially larger frontier LLMs, with ablations confirming that genome-token fusion, dynamic tool use, and the counterfactual reward each yield distinct, significant gains.

Problem

Research questions and friction points this paper is trying to address.

microbial life-boundary prediction

genome-to-physiology mapping

physiological traits

genotype-phenotype gap

substrate utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

genome-conditioned LLM

life-boundary prediction

tool-augmented agent