🤖 AI Summary
This work addresses the challenge of constructing a “virtual cell”—a computational system leveraging large language models (LLMs) to represent, predict, and reason about cellular states and behaviors—across three core tasks: cellular representation learning, perturbation response prediction, and gene regulatory inference. We propose a unified taxonomy distinguishing *predictive* and *agentic* LLM paradigms, and systematically survey existing models, datasets, and evaluation benchmarks. To tackle critical limitations in scalability, cross-condition generalization, and mechanistic interpretability, we integrate biological priors, scientific task orchestration, and multi-step reasoning techniques. Our contribution is threefold: (1) a foundational theoretical framework for virtual cell research; (2) a practical methodology guide grounded in domain-aware LLM design; and (3) a forward-looking roadmap for advancing LLM-driven computational biology. This work significantly accelerates the paradigm shift toward biologically grounded, reasoning-capable AI in systems biology.
📝 Abstract
Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.