🤖 AI Summary
Current language agents lack rigorous evaluation and exhibit questionable task generalization in data-driven scientific discovery. Method: We introduce SciBench—the first rigorous benchmark for evaluating language agents in science—comprising 102 authentic research tasks across four disciplines, with standardized executable Python program outputs. We propose a novel three-dimensional evaluation framework assessing (i) program executability, (ii) result correctness, and (iii) computational cost. Task construction employs a paradigm combining real-paper extraction and domain-expert co-verification, augmented by a dual mechanism to prevent data contamination. Contribution/Results: Experiments on five state-of-the-art LLMs show that independent success rates across three evaluation frameworks peak at only 32.4%; integrating expert knowledge marginally improves performance to 34.3%. These results demonstrate that existing language agents remain substantially short of enabling end-to-end automated scientific discovery.
📝 Abstract
The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about their true capabilities. In this work, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands CodeAct, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. In addition, we evaluate OpenAI o1 with direct prompting and self-debug, which demonstrates the effectiveness of increasing inference-time compute. Still, our results underscore the limitations of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.