AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Current in vitro phenotypic screening lacks standardized benchmarks, and existing approaches often rely on indirect molecular readouts that fail to meet the demands of real-world drug discovery for accurate cellular phenotype prediction. To address this gap, this work proposes AssayBench—the first benchmark for large language models (LLMs) and AI agents in cellular phenotypic screening—constructed from 1,920 public CRISPR screening experiments. The task is formulated as gene ranking, and a continuous evaluation metric, adjusted nDCG, is introduced to accommodate heterogeneous assay designs. Experiments demonstrate that general-purpose LLMs, enhanced through prompt engineering and ensemble strategies, significantly outperform domain-specific models in zero-shot settings. Nevertheless, all current methods remain substantially below the empirical performance ceiling, highlighting room for improvement. AssayBench thus provides a robust platform for evaluating virtual cellular modeling approaches.

📝 Abstract

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.

Problem

Research questions and friction points this paper is trying to address.

virtual cell

phenotypic screening

large language models

benchmark

CRISPR screens

Innovation

Methods, ideas, or system contributions that make the work stand out.

AssayBench

virtual cell

phenotypic screening