🤖 AI Summary
Current large language models (LLMs) struggle to autonomously execute multi-step iterative workflows and resolve complex errors in long-horizon machine learning engineering (MLE) tasks. To address this, we propose MLE-Dojo: the first Gym-style interactive benchmark environment designed for realistic Kaggle competitions, enabling LLM agents to perform executable, verifiable experiments across the full MLE pipeline—including data preprocessing, modeling, hyperparameter tuning, and debugging. Its key contributions are: (1) a suite of 200+ executable challenges covering error diagnosis, multi-step reasoning, and engineering robustness; (2) integration of a Python sandbox, dynamic feedback, and real-time validation to support joint RL and supervised fine-tuning (SFT); and (3) flexible support for diverse model interfaces and customizable toolchains. Evaluation on eight state-of-the-art LLMs demonstrates significant improvements in iterative capability, while revealing fundamental bottlenecks in long-horizon planning and deep error repair. The framework is open-sourced to foster reproducible and scalable MLE agent research.
📝 Abstract
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.