ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

📅 2024-10-07

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 1

career value

204K/year

🤖 AI Summary

Current language agents lack rigorous evaluation and exhibit questionable task generalization in data-driven scientific discovery. Method: We introduce SciBench—the first rigorous benchmark for evaluating language agents in science—comprising 102 authentic research tasks across four disciplines, with standardized executable Python program outputs. We propose a novel three-dimensional evaluation framework assessing (i) program executability, (ii) result correctness, and (iii) computational cost. Task construction employs a paradigm combining real-paper extraction and domain-expert co-verification, augmented by a dual mechanism to prevent data contamination. Contribution/Results: Experiments on five state-of-the-art LLMs show that independent success rates across three evaluation frameworks peak at only 32.4%; integrating expert knowledge marginally improves performance to 34.3%. These results demonstrate that existing language agents remain substantially short of enabling end-to-end automated scientific discovery.

Technology Category

Application Category

📝 Abstract

The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1 with direct prompting and self-debug, which demonstrates the effectiveness of increasing inference-time compute. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

Problem

Research questions and friction points this paper is trying to address.

Assessing language agents' capabilities in scientific workflows rigorously

Evaluating agents' performance on data-driven scientific discovery tasks

Mitigating data contamination in agent-based scientific automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for evaluating language agents scientifically

Python programs as unified task outputs

Strategies to mitigate data contamination concerns

🔎 Similar Papers

BLADE: Benchmarking Language Model Agents for Data-Driven Science

2024-08-19Conference on Empirical Methods in Natural Language ProcessingCitations: 35

ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models

2024-04-11arXiv.orgCitations: 17

Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data

2024-02-15arXiv.orgCitations: 6

💼 Related Jobs

Research Scientist, AI Language