🤖 AI Summary
Data cleaning remains highly manual, inefficient, and error-prone. This paper proposes the first goal-driven LLM-based framework for automatic workflow generation: given a dirty table and a target query, it end-to-end generates a minimal viable clean table along with executable cleaning steps—including deduplication, missing-value imputation, and format standardization. Our contributions are threefold: (1) We introduce the first benchmark dataset comprising annotated quadruples of (goal, dirty table, cleaning workflow, cleaned answer); (2) We design a zero-shot, multi-stage prompting framework—requiring no fine-tuning—that decomposes the task into goal column identification, data quality diagnosis, and operation-parameter generation; (3) We empirically validate that off-the-shelf LLMs possess inherent reasoning capabilities sufficient to generate high-quality, executable cleaning workflows across three major LLM families, significantly reducing human intervention.
📝 Abstract
We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation&Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.