FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

๐Ÿ“… 2025-10-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing AutoML agent evaluation benchmarks emphasize engineering implementation while lacking academic rigor and task diversity, hindering systematic assessment of foundational research capabilities. Method: We introduce the first benchmark specifically designed for fundamental machine learning research, comprising eight representative scientific problems and a five-dimensional evaluation framework that explicitly quantifies the critical role of exploration breadth in scientific discovery. The benchmark employs an LLM-driven intelligent agent architecture supporting automated experiment execution, low-code task interfaces, and extensible GitHub-based project integration. Contribution/Results: Empirical evaluation demonstrates that exploration-oriented agents significantly outperform locally deep-optimizing strategies in scientific discovery efficiency. Our benchmark establishes a new paradigm for reproducible, scalable, and verifiable evaluation of autonomous scientific agentsโ€”enabling rigorous, task-diverse, and theory-grounded assessment of foundational research capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at https://github.com/qrzou/FML-bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating autonomous ML research agents' scientific capabilities comprehensively
Addressing limited task diversity in existing ML agent benchmarks
Assessing agent performance on fundamental ML problems rather than applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

FML-bench benchmark evaluates ML research agents
Reduces coding burden and emphasizes fundamental problems
Uses five metrics to assess exploration breadth
๐Ÿ”Ž Similar Papers
No similar papers found.
Qiran Zou
Qiran Zou
NUS | THU
Computer VisionMachine Learning
Hou Hei Lam
Hou Hei Lam
Tsinghua University
AI
W
Wenhao Zhao
National University of Singapore
Y
Yiming Tang
National University of Singapore
T
Tingting Chen
National University of Singapore
S
Samson Yu
National University of Singapore
T
Tianyi Zhang
University of Minnesota
C
Chang Liu
Tsinghua University
X
Xiangyang Ji
Tsinghua University
Dianbo Liu
Dianbo Liu
Assistant professor, National University of Singapore
Push the limits of humanmachine learningbiomedical sciences