MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 21
Influential: 2
📄 PDF
🤖 AI Summary
Existing AI agents lack systematic evaluation for Machine Learning Engineering (MLE) capabilities. Method: We introduce MLE-bench, the first benchmark dedicated to MLE competence—comprising 75 real-world Kaggle competition tasks spanning core engineering stages including data preprocessing, model training, and experiment management. We formally define and quantify MLE capability dimensions, establish a human baseline from Kaggle participants, and release an open-source, reproducible automated evaluation framework. To ensure assessment integrity, we integrate state-of-the-art open-agent frameworks (e.g., AIDE) with large language models (e.g., o1-preview) and conduct rigorous data contamination analysis. Results: The optimal configuration (o1-preview + AIDE) achieves Kaggle Bronze-level performance on 16.9% of tasks. Our analysis reveals critical dependencies of MLE capability on computational scaling and pretraining data contamination, providing foundational insights for agent development in ML engineering.

Technology Category

Application Category

📝 Abstract
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluate AI agents' ML engineering skills
Assess performance on Kaggle competitions
Investigate AI resource scaling impact
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for ML engineering
Evaluates AI agents
Open-source agent scaffolds
🔎 Similar Papers
No similar papers found.
J
Jun Shern Chan
OpenAI
Neil Chowdhury
Neil Chowdhury
Transluce
O
Oliver Jaffe
OpenAI
J
James Aung
OpenAI
D
Dane Sherburn
OpenAI
E
Evan Mays
OpenAI
G
Giulio Starace
OpenAI
K
Kevin Liu
L
Leon Maksin
T
Tejal A. Patwardhan
Lilian Weng
Lilian Weng
A
Aleksander Mkadry