AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

🤖 AI Summary

Data cleaning remains highly manual, inefficient, and error-prone. This paper proposes the first goal-driven LLM-based framework for automatic workflow generation: given a dirty table and a target query, it end-to-end generates a minimal viable clean table along with executable cleaning steps—including deduplication, missing-value imputation, and format standardization. Our contributions are threefold: (1) We introduce the first benchmark dataset comprising annotated quadruples of (goal, dirty table, cleaning workflow, cleaned answer); (2) We design a zero-shot, multi-stage prompting framework—requiring no fine-tuning—that decomposes the task into goal column identification, data quality diagnosis, and operation-parameter generation; (3) We empirically validate that off-the-shelf LLMs possess inherent reasoning capabilities sufficient to generate high-quality, executable cleaning workflows across three major LLM families, significantly reducing human intervention.

Technology Category

Application Category

📝 Abstract

We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation&Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Automating data cleaning workflow generation using LLMs

Addressing format inconsistencies, type errors, duplicates in datasets

Evaluating workflow quality against human-curated benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based pipeline for automatic workflow generation

Generates OpenRefine operations for data cleaning

Benchmark evaluates answer, data, and workflow quality

🔎 Similar Papers

No similar papers found.

Authors to Follow