Benchmarking AI scientists in omics data-driven biological research

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing AI evaluation benchmarks predominantly focus on either data-agnostic reasoning or analysis with predefined answers, lacking rigorous assessment of AI’s capability to drive authentic, closed-loop biological discovery. Method: We introduce BaisBench—the first benchmark explicitly designed to evaluate AI scientists on real-world biological discovery—comprising two core tasks: (1) single-cell data-driven cell type annotation and (2) cutting-edge literature-driven scientific question reasoning. Our framework integrates multimodal single-cell omics data, expert curation, LLM-based multi-agent reasoning, and external biological knowledge retrieval to enable data-driven, knowledge-augmented, and reproducible evaluation. Contribution/Results: Comprehensive evaluation reveals that state-of-the-art AI scientists still significantly underperform human experts on both tasks, establishing a quantitative baseline for model improvement. BaisBench is released as an open-source evaluation platform to advance AI-driven biological discovery.

Technology Category

Application Category

📝 Abstract

The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists' ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: https://github.com/EperLuo/BaisBench.

Problem

Research questions and friction points this paper is trying to address.

Assessing AI scientists' ability in autonomous biological research

Evaluating data-driven biological discovery with external knowledge

Benchmarking AI performance against human experts in omics research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for AI scientists in biological research

Combines data analysis with external knowledge reasoning

Evaluates models on cell annotation and scientific discovery

🔎 Similar Papers

Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data