scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery

📅 2026-02-12
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of interpretable, auditable, and domain-informed automated reasoning methods in single-cell RNA sequencing analysis. The authors propose an “omics-native reasoning” paradigm and develop the first framework enabling large language models to directly invoke single-cell data and bioinformatics tools within natural language dialogues. Key tasks—such as cell type annotation, developmental trajectory reconstruction, and transcription factor target inference—are reformulated as iterative, stepwise reasoning processes that support correction and refinement. By integrating multi-turn reasoning, dynamic tool invocation, and evaluation on the scBench benchmark, the approach ensures transparent and traceable analytical logic. Experiments demonstrate that iterative reasoning improves cell type annotation accuracy by 11% over one-shot prompting, reduces graph edit distance by 30% in trajectory reconstruction using Gemini-2.5-Pro, and effectively resolves ambiguities in marker gene interpretation and regulatory mechanisms.

Technology Category

Application Category

📝 Abstract
We present scPilot, the first systematic framework to practice omics-native reasoning: a large language model (LLM) converses in natural language while directly inspecting single-cell RNA-seq data and on-demand bioinformatics tools. scPilot converts core single-cell analyses, i.e., cell-type annotation, developmental-trajectory reconstruction, and transcription-factor targeting, into step-by-step reasoning problems that the model must solve, justify, and, when needed, revise with new evidence. To measure progress, we release scBench, a suite of 9 expertly curated datasets and graders that faithfully evaluate the omics-native reasoning capability of scPilot w.r.t various LLMs. Experiments with o1 show that iterative omics-native reasoning lifts average accuracy by 11% for cell-type annotation and Gemini-2.5-Pro cuts trajectory graph-edit distance by 30% versus one-shot prompting, while generating transparent reasoning traces explain marker gene ambiguity and regulatory logic. By grounding LLMs in raw omics data, scPilot enables auditable, interpretable, and diagnostically informative single-cell analyses. Code, data, and package are available at https://github.com/maitrix-org/scPilot
Problem

Research questions and friction points this paper is trying to address.

single-cell RNA-seq
cell-type annotation
developmental trajectory
transcription-factor targeting
omics-native reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

omics-native reasoning
large language model
single-cell RNA-seq
automated biological analysis
interpretable AI
Y
Yiming Gao
UC San Diego, Texas A&M
Zhen Wang
Zhen Wang
Postdoc at UCSD
Machine LearningLarge Language ModelsNatural Language Processing
J
Jefferson Chen
UC San Diego
M
Mark Antkowiak
UC San Diego
M
Mengzhou Hu
UC San Diego
J
JungHo Kong
UC San Diego
Dexter Pratt
Dexter Pratt
UC San Diego
NDExCytoscapeNetwork BiologySystems Biology
J
Jieyuan Liu
UC San Diego
E
Enze Ma
UC San Diego
Zhiting Hu
Zhiting Hu
Assistant Professor at UC San Diego
Machine LearningArtificial IntelligenceNatural Language Processing
E
Eric P. Xing
MBZUAI, CMU