AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of systematic benchmarks for evaluating large language model agents across the full scientific research pipeline. To this end, it introduces the first comprehensive benchmark specifically designed for autonomous scientific agents, comprising 20 tasks derived from cutting-edge machine learning papers spanning domains such as language modeling, mathematics, bioinformatics, and time series forecasting. The benchmark supports flexible extension and enables rigorous comparison of agent frameworks while deliberately withholding baseline code to authentically assess autonomy. Experimental results reveal that current agents surpass state-of-the-art human performance on four tasks, yet fall short of human-level capabilities on the remaining sixteen, with none approaching theoretical upper bounds—highlighting both the benchmark’s inherent difficulty and substantial room for future advancement.

Technology Category

Application Category

📝 Abstract
LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle -- including idea generation, experiment analysis and iterative refinement -- without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
Problem

Research questions and friction points this paper is trying to address.

AI research agents
scientific benchmark
autonomous scientific research
agentic capabilities
LLM agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI research agents
scientific benchmark
autonomous scientific discovery
LLM agents
research lifecycle evaluation
🔎 Similar Papers
No similar papers found.
Alisia Lupidi
Alisia Lupidi
University of Cambridge
B
Bhavul Gauri
FAIR at Meta
T
Thomas Simon Foster
FAIR at Meta, University of Oxford
Bassel Al Omari
Bassel Al Omari
University of Waterloo
Despoina Magka
Despoina Magka
University of Oxford, Department of Computer Science
Artificial intelligenceKnowledge representation and reasoningLogic
Alberto Pepe
Alberto Pepe
Sage Bionetworks
Computational AstrophysicsInformation ScienceScholarly Communication
A
Alexis Audran-Reiss
FAIR at Meta
M
Muna Aghamelu
FAIR at Meta
N
Nicolas Baldwin
FAIR at Meta
L
Lucia Cipolina-Kun
FAIR at Meta
J
Jean-Christophe Gagnon-Audet
FAIR at Meta
C
Chee Hau Leow
FAIR at Meta
S
Sandra Lefdal
FAIR at Meta
H
Hossam Mossalam
FAIR at Meta
Abhinav Moudgil
Abhinav Moudgil
Mila, Concordia University
Deep Learning
S
Saba Nazir
FAIR at Meta
Emanuel Tewolde
Emanuel Tewolde
Carnegie Mellon University
Artificial IntelligenceAlgorithmic Game TheoryReinforcement LearningMathematical Optimization
I
Isabel Urrego
FAIR at Meta
J
Jordi Armengol Estape
FAIR at Meta
A
Amar Budhiraja
FAIR at Meta
Gaurav Chaurasia
Gaurav Chaurasia
Meta
AI agentsComputer VisionCompute Graphics
A
Abhishek Charnalia
FAIR at Meta
D
Derek Dunfield
FAIR at Meta
Karen Hambardzumyan
Karen Hambardzumyan
FAIR, Meta + University College London
InterpretabilityNatural Language ProcessingFew-Shot Learning
D
Daniel Izcovich
FAIR at Meta