FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Existing AutoML agent evaluation benchmarks emphasize engineering implementation while lacking academic rigor and task diversity, hindering systematic assessment of foundational research capabilities. Method: We introduce the first benchmark specifically designed for fundamental machine learning research, comprising eight representative scientific problems and a five-dimensional evaluation framework that explicitly quantifies the critical role of exploration breadth in scientific discovery. The benchmark employs an LLM-driven intelligent agent architecture supporting automated experiment execution, low-code task interfaces, and extensible GitHub-based project integration. Contribution/Results: Empirical evaluation demonstrates that exploration-oriented agents significantly outperform locally deep-optimizing strategies in scientific discovery efficiency. Our benchmark establishes a new paradigm for reproducible, scalable, and verifiable evaluation of autonomous scientific agents—enabling rigorous, task-diverse, and theory-grounded assessment of foundational research capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous ML research agents' scientific capabilities comprehensively

Addressing limited task diversity in existing ML agent benchmarks

Assessing agent performance on fundamental ML problems rather than applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

FML-bench benchmark evaluates ML research agents

Reduces coding burden and emphasizes fundamental problems

Uses five metrics to assess exploration breadth

🔎 Similar Papers

Large Language Model Agent for Hyper-Parameter Optimization