AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI agent benchmarks suffer from insufficient holism, poor reproducibility, weak confounding-variable control, inconsistent interfaces, and absent baselines—limiting their validity for evaluating agents in authentic scientific research settings. To address this, we propose AstaBench: the first standardized benchmark suite for end-to-end assessment of scientific research capabilities, comprising 2,400+ interdisciplinary scientific problems spanning hypothesis generation, literature retrieval, experimental design, and data analysis. Methodologically, it introduces science-oriented evaluation principles, integrates production-grade search tools and modular APIs, rigorously controls confounders (e.g., model cost, tool access), and establishes nine scientifically optimized agent types alongside a multi-tiered baseline hierarchy. Systematic evaluation of 57 agents reveals that while current AI systems achieve moderate performance on isolated tasks, they exhibit substantial deficits in coherent, autonomous, and trustworthy scientific reasoning and execution.

Technology Category

Application Category

📝 Abstract
AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose"deep research"systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking AI agents' scientific research capabilities holistically
Addressing confounding variables in agent evaluation like cost
Providing reproducible tools for controlled agent comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scientific research environment with production-grade search tools
Comprehensive suite of nine science-optimized agent classes
Holistic benchmark with 2400+ scientific discovery problems
🔎 Similar Papers
No similar papers found.
Jonathan Bragg
Jonathan Bragg
Allen Institute for AI (AI2)
Artificial IntelligenceHuman-Computer InteractionCrowdsourcing
Mike D'Arcy
Mike D'Arcy
Allen Institute for AI
Artificial IntelligenceMachine LearningNatural Language Processing
N
Nishant Balepur
University of Maryland
Dan Bareket
Dan Bareket
Allen Institute for Artificial Intelligence
B
Bhavana Dalvi
Asta Team, Allen Institute for AI
Sergey Feldman
Sergey Feldman
Allen Institute of Artificial Intelligence, Alongside Care
Machine LearningEstimationPattern Recognition
D
Dany Haddad
Asta Team, Allen Institute for AI
Jena D. Hwang
Jena D. Hwang
Allen Institute for AI
natural language processingcomputational linguisticscommonsense reasoninglexical semantics
P
P. Jansen
Asta Team, Allen Institute for AI, University of Arizona
Varsha Kishore
Varsha Kishore
Cornell University
Machine Learning
Bodhisattwa Prasad Majumder
Bodhisattwa Prasad Majumder
Researcher, Allen Institute for AI
Natural Language ProcessingInteractive AgentsMachine ReasoningScientific Discovery
Aakanksha Naik
Aakanksha Naik
Allen Institute for Artificial Intelligence
Natural Language ProcessingMachine Learning
S
Sigal Rahamimov
Asta Team, Allen Institute for AI
K
Kyle Richardson
Asta Team, Allen Institute for AI
A
Amanpreet Singh
Asta Team, Allen Institute for AI
Harshit Surana
Harshit Surana
Co-Founder at Chaos Genius
Scientific DiscoveryMachine LearningConvex Optimization
Aryeh Tiktinsky
Aryeh Tiktinsky
M.Sc. student, Bar-Ilan University
Natural language processing
R
Rosni Vasu
University of Zurich
G
Guy Wiener
Asta Team, Allen Institute for AI
C
Chloe Anastasiades
Asta Team, Allen Institute for AI
S
Stefan Candra
Asta Team, Allen Institute for AI
Jason Dunkelberger
Jason Dunkelberger
Semantic Scholar
D
Dan Emery
Asta Team, Allen Institute for AI
R
Rob Evans
Asta Team, Allen Institute for AI
M
Malachi Hamada
Asta Team, Allen Institute for AI
R
Regan Huff
Asta Team, Allen Institute for AI
R
Rodney Kinney
Asta Team, Allen Institute for AI
Matt Latzke
Matt Latzke
Allen Institute for AI
Accessibility
J
Jaron Lochner
Asta Team, Allen Institute for AI
R
Ruben Lozano-Aguilera
Asta Team, Allen Institute for AI
C
Cecile Nguyen
Asta Team, Allen Institute for AI
Smita Rao
Smita Rao
Asta Team, Allen Institute for AI
A
Amber Tanaka
Asta Team, Allen Institute for AI
B
Brooke Vlahos
Asta Team, Allen Institute for AI
Peter Clark
Peter Clark
Allen Institute for Artificial Intelligence (AI2)
Artificial Intelligence
D
Doug Downey
Asta Team, Allen Institute for AI
Yoav Goldberg
Yoav Goldberg
Professor, Bar Ilan University. Research Director, AI2-Israel
Natural Language ProcessingMachine LearningSyntactic Processing
Ashish Sabharwal
Ashish Sabharwal
Allen Institute for AI (AI2)
Artificial IntelligenceConstraint ReasoningProbabilistic Infererence
D
Daniel S. Weld
Asta Team, Allen Institute for AI