ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

📅 2024-10-07
🏛️ arXiv.org
📈 Citations: 14
Influential: 1
📄 PDF
🤖 AI Summary
Current language agents lack rigorous evaluation and exhibit questionable task generalization in data-driven scientific discovery. Method: We introduce SciBench—the first rigorous benchmark for evaluating language agents in science—comprising 102 authentic research tasks across four disciplines, with standardized executable Python program outputs. We propose a novel three-dimensional evaluation framework assessing (i) program executability, (ii) result correctness, and (iii) computational cost. Task construction employs a paradigm combining real-paper extraction and domain-expert co-verification, augmented by a dual mechanism to prevent data contamination. Contribution/Results: Experiments on five state-of-the-art LLMs show that independent success rates across three evaluation frameworks peak at only 32.4%; integrating expert knowledge marginally improves performance to 34.3%. These results demonstrate that existing language agents remain substantially short of enabling end-to-end automated scientific discovery.

Technology Category

Application Category

📝 Abstract
The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1 with direct prompting and self-debug, which demonstrates the effectiveness of increasing inference-time compute. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.
Problem

Research questions and friction points this paper is trying to address.

Assessing language agents' capabilities in scientific workflows rigorously
Evaluating agents' performance on data-driven scientific discovery tasks
Mitigating data contamination in agent-based scientific automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for evaluating language agents scientifically
Python programs as unified task outputs
Strategies to mitigate data contamination concerns
Ziru Chen
Ziru Chen
The Ohio State University
Conversational AINatural Language ProcessingMachine Learning
Shijie Chen
Shijie Chen
PhD Student, The Ohio State University
Natural Language ProcessingMachine Learning
Yuting Ning
Yuting Ning
The Ohio State University
Natural Language Processing
Q
Qianheng Zhang
Department of Geography, UW–Madison
Boshi Wang
Boshi Wang
The Ohio State University
Natural Language ProcessingMachine Learning
Botao Yu
Botao Yu
PhD student, Ohio State University
AI for ScienceNLPAI Music
Y
Yifei Li
Department of Computer Science and Engineering, OSU
Zeyi Liao
Zeyi Liao
The Ohio State University
AINLPMultimodalAgent
C
Chen Wei
Department of Geography, UW–Madison
Z
Zitong Lu
Department of Psychology, OSU
Vishal Dey
Vishal Dey
FAIR (MSL)
LLM Post-trainingAI AgentsTransfer LearningML for Drug DiscoveryAI4Science
M
Mingyi Xue
Department of Chemistry, UW–Madison
F
Frazier N. Baker
Department of Computer Science and Engineering, OSU, Department of Biomedical Informatics, OSU
B
Benjamin Burns
Department of Computer Science and Engineering, OSU
Daniel Adu-Ampratwum
Daniel Adu-Ampratwum
Research Assistant Professor, Ohio State University
Organic ChemistryNatural Product SynthesisMedicinal ChemistryDrug Discovery.
X
Xuhui Huang
Department of Chemistry, UW–Madison
Xia Ning
Xia Ning
Professor, Biomedical Informatics, Computer Science and Engineering, The Ohio State
GenAIMedical AILLMsDrug Development
S
Song Gao
Department of Geography, UW–Madison
Y
Yu Su
Department of Computer Science and Engineering, OSU
Huan Sun
Huan Sun
Endowed CoE Innovation Scholar and Associate Professor, The Ohio State University
AgentsLarge Language ModelsNatural Language ProcessingAI