Towards Practical Benchmarking of Data Cleaning Techniques: On Generating Authentic Errors via Large Language Models

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data quality assessment is hindered by the scarcity of realistic, diverse erroneous datasets and the high cost and inconsistency of manual annotation. To address this, we propose TableEG, the first LLM-driven framework for generating realistic tabular errors across multiple domains. TableEG models error generation as a triplet (input table I, transformation operation T, output table O) and employs table-structure-aware fine-tuning to faithfully capture complex intra-table dependencies and realistic error distributions. Experiments on 12 real-world datasets demonstrate that TableEG-generated errors closely approximate ground-truth errors in type, frequency, and spatial distribution. Moreover, its evaluation of error detection algorithms—especially ML-based ones—exhibits strong agreement with human annotations (average Spearman’s ρ > 0.92). TableEG thus establishes a scalable, reproducible, and systematic benchmark for evaluating data cleaning techniques.

Technology Category

Application Category

📝 Abstract
Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation $(I, T, O)$ to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.
Problem

Research questions and friction points this paper is trying to address.

Generating authentic errors in tabular data using LLMs
Addressing lack of diverse real-world error datasets
Improving evaluation of error detection and correction techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs for authentic error generation
Uses triplet representation for error tasks
Fine-tuned on diverse real-world datasets
🔎 Similar Papers
No similar papers found.