Synthetic Sandbox for Training Machine Learning Engineering Agents

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of on-policy reinforcement learning for machine learning engineering (MLE) agents, which stems from the need to repeatedly execute full ML pipelines. To overcome this challenge, the authors propose SandMLE, a framework that enables large-scale on-policy reinforcement learning in MLE for the first time. SandMLE leverages multi-agent collaboration to generate synthetic sandbox environments that are structurally complex yet extremely data-efficient—requiring only 50–200 samples per task—thereby preserving real-world task complexity while drastically reducing validation overhead. Experiments demonstrate that SandMLE reduces training time by over 13× and achieves a 20.3%–66.9% improvement in medal rate over supervised fine-tuning baselines on MLE-bench-lite. Furthermore, it attains up to a 32.4% HumanRank generalization gain on MLE-Dojo.
📝 Abstract
As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.
Problem

Research questions and friction points this paper is trying to address.

machine learning engineering
agent verification
on-policy reinforcement learning
ML pipeline
computational bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic sandbox
machine learning engineering agents
on-policy reinforcement learning
micro-scale datasets
SandMLE
🔎 Similar Papers
No similar papers found.