MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Machine Learning Engineering (MLE) benchmarks suffer from manual construction, poor scalability, and narrow coverage. Method: We propose the first fully automated multi-agent pipeline that dynamically transforms raw datasets into competition-grade MLE tasks via a “generate–verify–execute” paradigm, integrating structured task design, hybrid verification (encompassing syntactic correctness, semantic plausibility, and interactive feasibility), and standardized reconstruction. Contribution/Results: Our pipeline automatically generates 606 diverse, cross-category, cross-objective, and cross-modal tasks from 224 real-world datasets—surpassing manual benchmarks in both scale and diversity. Empirical evaluation demonstrates strong performance alignment (Spearman’s ρ > 0.92) between state-of-the-art large models on auto-generated versus human-crafted tasks, confirming task quality and assessment validity. This work establishes a scalable, reproducible, and verifiable paradigm for MLE benchmark construction.

Technology Category

Application Category

📝 Abstract
While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.
Problem

Research questions and friction points this paper is trying to address.

Automating scalable machine learning engineering task generation from raw datasets
Overcoming limitations of static manually curated MLE benchmarks
Creating diverse verifiable MLE challenges with real-world applicability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated multi-agent pipeline for MLE task generation
Generate-verify-execute paradigm ensures quality and diversity
Hybrid verification mechanism enforces structural and semantic rules
🔎 Similar Papers
No similar papers found.