Benchmarking AI scientists in omics data-driven biological research

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI evaluation benchmarks predominantly focus on either data-agnostic reasoning or analysis with predefined answers, lacking rigorous assessment of AI’s capability to drive authentic, closed-loop biological discovery. Method: We introduce BaisBench—the first benchmark explicitly designed to evaluate AI scientists on real-world biological discovery—comprising two core tasks: (1) single-cell data-driven cell type annotation and (2) cutting-edge literature-driven scientific question reasoning. Our framework integrates multimodal single-cell omics data, expert curation, LLM-based multi-agent reasoning, and external biological knowledge retrieval to enable data-driven, knowledge-augmented, and reproducible evaluation. Contribution/Results: Comprehensive evaluation reveals that state-of-the-art AI scientists still significantly underperform human experts on both tasks, establishing a quantitative baseline for model improvement. BaisBench is released as an open-source evaluation platform to advance AI-driven biological discovery.

Technology Category

Application Category

📝 Abstract
The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists' ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: https://github.com/EperLuo/BaisBench.
Problem

Research questions and friction points this paper is trying to address.

Assessing AI scientists' ability in autonomous biological research
Evaluating data-driven biological discovery with external knowledge
Benchmarking AI performance against human experts in omics research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for AI scientists in biological research
Combines data analysis with external knowledge reasoning
Evaluates models on cell annotation and scientific discovery
🔎 Similar Papers
No similar papers found.
E
Erpai Luo
Tsinghua University
J
Jinmeng Jia
Tsinghua University
Yifan Xiong
Yifan Xiong
Microsoft Research
X
Xiangyu Li
Beijing Jiaotong University
Xiaobo Guo
Xiaobo Guo
Dartmouth College
machine learningdeep learningnatural language processingsocia mediapropagantion
B
Baoqi Yu
Capital Medical University
L
Lei Wei
Tsinghua University
Xuegong Zhang
Xuegong Zhang
Tsinghua University
BioinformaticsComputational BiologyPattern RecognitionMachine Learning