🤖 AI Summary
Existing AI agents lack systematic evaluation for Machine Learning Engineering (MLE) capabilities. Method: We introduce MLE-bench, the first benchmark dedicated to MLE competence—comprising 75 real-world Kaggle competition tasks spanning core engineering stages including data preprocessing, model training, and experiment management. We formally define and quantify MLE capability dimensions, establish a human baseline from Kaggle participants, and release an open-source, reproducible automated evaluation framework. To ensure assessment integrity, we integrate state-of-the-art open-agent frameworks (e.g., AIDE) with large language models (e.g., o1-preview) and conduct rigorous data contamination analysis. Results: The optimal configuration (o1-preview + AIDE) achieves Kaggle Bronze-level performance on 16.9% of tasks. Our analysis reveals critical dependencies of MLE capability on computational scaling and pretraining data contamination, providing foundational insights for agent development in ML engineering.
📝 Abstract
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.