MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

📅 2024-10-09

🏛️ arXiv.org

📈 Citations: 21

✨ Influential: 2

career value

246K/year

🤖 AI Summary

Existing AI agents lack systematic evaluation for Machine Learning Engineering (MLE) capabilities. Method: We introduce MLE-bench, the first benchmark dedicated to MLE competence—comprising 75 real-world Kaggle competition tasks spanning core engineering stages including data preprocessing, model training, and experiment management. We formally define and quantify MLE capability dimensions, establish a human baseline from Kaggle participants, and release an open-source, reproducible automated evaluation framework. To ensure assessment integrity, we integrate state-of-the-art open-agent frameworks (e.g., AIDE) with large language models (e.g., o1-preview) and conduct rigorous data contamination analysis. Results: The optimal configuration (o1-preview + AIDE) achieves Kaggle Bronze-level performance on 16.9% of tasks. Our analysis reveals critical dependencies of MLE capability on computational scaling and pretraining data contamination, providing foundational insights for agent development in ML engineering.

Technology Category

Application Category

📝 Abstract

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Problem

Research questions and friction points this paper is trying to address.

Evaluate AI agents' ML engineering skills

Assess performance on Kaggle competitions

Investigate AI resource scaling impact

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for ML engineering

Evaluates AI agents

Open-source agent scaffolds

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation