DataMaster: Towards Autonomous Data Engineering for Machine Learning

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the critical yet under-automated role of data engineering in modern machine learning systems, where performance heavily relies on high-quality data. The authors propose DataMaster, a framework that autonomously optimizes the data pipeline—encompassing external data discovery, selection, cleaning, and transformation—conditioned on the downstream learning task to enhance the performance of a fixed learning algorithm. Its key innovations include a DataTree structure to organize search branches, a shared Data Pool for reusable data assets, and a Global Memory mechanism enabling cross-branch knowledge transfer, collectively tackling the challenges of open-ended search spaces and delayed reward signals. Experiments demonstrate that DataMaster improves the medal rate by 32.27% on MLE-Bench Lite and achieves 31.02% accuracy on the GPQA task in PostTrainBench, significantly outperforming existing instruction-tuned models.

📝 Abstract

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

Problem

Research questions and friction points this paper is trying to address.

autonomous data engineering

data optimization

machine learning

data discovery

data transformation

Innovation

Methods, ideas, or system contributions that make the work stand out.

autonomous data engineering

DataTree

shared Data Pool