Automatic Data Repair: Are We Ready to Deploy?

📅 2023-10-01

🏛️ Proceedings of the VLDB Endowment

📈 Citations: 3

✨ Influential: 0

career value

149K/year

🤖 AI Summary

In the era of generative AI, dirty data severely degrades downstream analytical accuracy and model performance, yet existing automated data repair algorithms lack systematic benchmarking and practical deployment guidance. Method: We conduct a comprehensive benchmark study of 12 state-of-the-art repair algorithms across 12 diverse datasets under varying error rates and error types; propose an information-theoretic algorithm taxonomy; design novel evaluation metrics balancing practical utility and interpretability; and assess repair efficacy across four downstream tasks—statistical analysis, model training, anomaly detection, and generative AI input quality. Contribution/Results: We reveal the critical insight that “clean data does not imply optimal analytical performance.” Repair consistently improves downstream task outcomes; our heuristic optimization strategy boosts the average error reduction rate of SOTA methods by 23.6%; and we deliver an industrial-grade applicability guide alongside a curated list of open challenges.

📝 Abstract

Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we compare and summarize these algorithms with a driven information-based taxonomy. We systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms on 12 datasets under the settings of various data error rates, error types, and 4 downstream analysis tasks, assessing their error reduction performance with a novel but practical metric. We develop an effective and unified repair optimization strategy that substantially benefits the state of the arts. We conclude that, it is always worthy of data repair. The clean data does not determine the upper bound of data analysis performance. We provide valuable guidelines, challenges, and promising directions in the data repair domain. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice.

Problem

Research questions and friction points this paper is trying to address.

Evaluating 12 data repair algorithms' performance under varying error conditions

Developing a unified optimization strategy to improve existing repair methods

Providing practical guidelines for deploying repair algorithms in real-world scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed a guided information-based taxonomy

Evaluated 12 algorithms with novel metric

Proposed unified repair optimization strategy

🔎 Similar Papers

Towards AI-Augmented Data Quality Management: From Data Quality for AI to AI for Data Quality Management

2024-06-16Citations: 0

Automated Test Case Repair Using Language Models

2024-01-12arXiv.orgCitations: 3

Bosch Group

Elchingen, BY, DE

Authors to Follow