🤖 AI Summary
Existing AI agent benchmarks suffer from insufficient holism, poor reproducibility, weak confounding-variable control, inconsistent interfaces, and absent baselines—limiting their validity for evaluating agents in authentic scientific research settings. To address this, we propose AstaBench: the first standardized benchmark suite for end-to-end assessment of scientific research capabilities, comprising 2,400+ interdisciplinary scientific problems spanning hypothesis generation, literature retrieval, experimental design, and data analysis. Methodologically, it introduces science-oriented evaluation principles, integrates production-grade search tools and modular APIs, rigorously controls confounders (e.g., model cost, tool access), and establishes nine scientifically optimized agent types alongside a multi-tiered baseline hierarchy. Systematic evaluation of 57 agents reveals that while current AI systems achieve moderate performance on isolated tasks, they exhibit substantial deficits in coherent, autonomous, and trustworthy scientific reasoning and execution.
📝 Abstract
AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose"deep research"systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.