LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) and agent-based models in single-cell biology suffer from fragmentation across data modalities, architectures, and evaluation criteria. Method: We systematically review 58 models and propose a unified classification framework encompassing six categories—foundation models, text-bridged models, spatial models, multimodal models, epigenomic models, and agent models—covering RNA, ATAC, multi-omics, and spatial modalities, and supporting eight core tasks including annotation and trajectory inference. We introduce the paradigm of “single-cell language-driven intelligence,” establishing cross-dataset–model–evaluation linkages and defining ten domain-specific evaluation dimensions (e.g., biological interpretability, multi-omics alignment, fairness, and privacy preservation). Results: Evaluated on 40+ public datasets and multimodal benchmarks, our analysis identifies critical challenges in interpretability, standardization, and trustworthy AI, providing the field with authoritative evaluation standards and a comprehensive technical roadmap.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.
Problem

Research questions and friction points this paper is trying to address.

Surveying language models for single-cell biology integration
Analyzing multimodal data and evaluation standards fragmentation
Addressing interpretability and standardization challenges in biological AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified survey of 58 single-cell foundation models
Categorizes methods into five model families
Evaluates models across ten domain dimensions
Sajib Acharjee Dip
Sajib Acharjee Dip
Ph.D. Student, Computer Science, Virginia Tech
LLMsMultimodal LearningBioinformatics
A
Adrika Zafor
Department of Computational Modeling and Data Analytics, Virginia Tech, Blacksburg, V A, USA
B
Bikash Kumar Paul
Department of Computer Science, Virginia Tech, Blacksburg, V A, USA
Uddip Acharjee Shuvo
Uddip Acharjee Shuvo
Software Engineer
Software EngineeringDeep learningHuman Computer Interaction
M
Muhit Islam Emon
Department of Computer Science, Virginia Tech, Blacksburg, V A, USA
X
Xuan Wang
Department of Computer Science, Virginia Tech, Blacksburg, V A, USA
Liqing Zhang
Liqing Zhang
Professor @ Computer Science, Virginia Tech
Bioinformaticsdata analyticsmachine learning