🤖 AI Summary
Current LLM-based agents lack objective benchmarks and feasible quantitative metrics for evaluating scientific discovery capabilities. This paper introduces MLRC-Bench—the first dynamic benchmark grounded in real-world, frontier machine learning research competitions—targeting open scientific problems without established solutions and establishing an objective evaluation protocol based on authentic competition tasks. We propose a novel metric, “Innovation Implementation Score,” to quantify the extent to which agents realize novel methodologies, exposing a significant misalignment between LLM-assessed innovation and actual scientific capability. Our hybrid evaluation framework integrates multi-round autonomous reasoning, code generation, closed-loop experimental execution, and human expert calibration, empirically validated using state-of-the-art agent architectures including MLAB. Across seven competition tasks, the best-performing agent closes only 9.3% of the performance gap relative to top human participants. The benchmark is open-sourced and designed for extensibility, establishing a rigorous, task-grounded paradigm for assessing AI’s scientific research competence.
📝 Abstract
Existing evaluation of large language model (LLM) agents on scientific discovery lacks objective baselines and metrics to assess the viability of their proposed methods. To address this issue, we introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions. Our benchmark highlights open research problems that demand novel methodologies, in contrast to recent benchmarks such as OpenAI's MLE-Bench (Chan et al., 2024) and METR's RE-Bench (Wijk et al., 2024), which focus on well-established research tasks that are largely solvable through sufficient engineering effort. Unlike prior work, e.g., AI Scientist (Lu et al., 2024b), which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with newly proposed rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB (Huang et al., 2024a)) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI's research capabilities.