DeepPrep: An LLM-Powered Agentic System for Autonomous Data Preparation

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the bottleneck in data science of automatically transforming heterogeneous, noisy tabular data into analysis-ready formats. We propose an iterative agent system grounded in execution feedback, featuring a tree-structured reasoning mechanism and a progressive training framework that overcomes the limitations of traditional linear interaction paradigms. By enabling non-local revisions of decisions based on intermediate execution outcomes, our approach integrates large language models, environment feedback, and synthetic data generation. The method achieves accuracy comparable to GPT-5 while reducing inference costs by 15×, establishing state-of-the-art performance among open-source models and demonstrating strong generalization across multiple datasets.

Technology Category

Application Category

📝 Abstract

Data preparation, which aims to transform heterogeneous and noisy raw tables into analysis-ready data, remains a major bottleneck in data science. Recent approaches leverage large language models (LLMs) to automate data preparation from natural language specifications. However, existing LLM-powered methods either make decisions without grounding in intermediate execution results, or rely on linear interaction processes that offer limited support for revising earlier decisions. To address these limitations, we propose DeepPrep, an LLM-powered agentic system for autonomous data preparation. DeepPrep constructs data preparation pipelines through iterative, execution-grounded interaction with an environment that materializes intermediate table states and returns runtime feedback. To overcome the limitations of linear interaction, DeepPrep organizes pipeline construction with tree-based agentic reasoning, enabling structured exploration and non-local revision based on execution feedback. To enable effective learning of such behaviors, we propose a progressive agentic training framework, together with data synthesis that supplies diverse and complex ADP tasks. Extensive experiments show that DeepPrep achieves data preparation accuracy comparable to strong closed-source models (e.g., GPT-5) while incurring 15x lower inference cost, while establishing state-of-the-art performance among open-source baselines and generalizing effectively across diverse datasets.

Problem

Research questions and friction points this paper is trying to address.

data preparation

large language models

autonomous data processing

execution feedback

pipeline revision

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic reasoning

execution-grounded interaction

tree-based pipeline construction