SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current evaluation methods for single-cell large language models are fragmented, disconnected from real-world applications, and lack biological grounding and interpretability. To address this, this work proposes SC-Arena—a natural language evaluation framework tailored for single-cell foundation models. SC-Arena unifies evaluation objectives through a virtual cell abstraction and introduces five core tasks, including cell type annotation and perturbation prediction, to assess models’ biological reasoning capabilities. The framework innovatively integrates a knowledge-enhanced mechanism that fuses ontologies, marker gene databases, and scientific literature to enable biologically plausible, interpretable, and discriminative evaluation. Experiments reveal that existing models exhibit limited understanding of biological mechanisms and causality, while SC-Arena significantly outperforms conventional string-matching metrics, achieving notable advances in biological plausibility and interpretability.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-Arena thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.

Problem

Research questions and friction points this paper is trying to address.

single-cell biology

large language models

evaluation benchmark

biological reasoning

interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge-augmented evaluation

virtual cell abstraction

single-cell reasoning