🤖 AI Summary
In the era of generative AI, dirty data severely degrades downstream analytical accuracy and model performance, yet existing automated data repair algorithms lack systematic benchmarking and practical deployment guidance. Method: We conduct a comprehensive benchmark study of 12 state-of-the-art repair algorithms across 12 diverse datasets under varying error rates and error types; propose an information-theoretic algorithm taxonomy; design novel evaluation metrics balancing practical utility and interpretability; and assess repair efficacy across four downstream tasks—statistical analysis, model training, anomaly detection, and generative AI input quality. Contribution/Results: We reveal the critical insight that “clean data does not imply optimal analytical performance.” Repair consistently improves downstream task outcomes; our heuristic optimization strategy boosts the average error reduction rate of SOTA methods by 23.6%; and we deliver an industrial-grade applicability guide alongside a curated list of open challenges.
📝 Abstract
Data quality is paramount in today's data-driven world, especially in the era of generative AI. Dirty data with errors and inconsistencies usually leads to flawed insights, unreliable decision-making, and biased or low-quality outputs from generative models. The study of repairing erroneous data has gained significant importance. Existing data repair algorithms differ in information utilization, problem settings, and are tested in limited scenarios. In this paper, we compare and summarize these algorithms with a driven information-based taxonomy. We systematically conduct a comprehensive evaluation of 12 mainstream data repair algorithms on 12 datasets under the settings of various data error rates, error types, and 4 downstream analysis tasks, assessing their error reduction performance with a
novel but practical
metric. We develop an effective and unified repair optimization strategy that substantially benefits the state of the arts. We conclude that, it is always worthy of data repair. The clean data does not determine the upper bound of data analysis performance. We provide valuable guidelines, challenges, and promising directions in the data repair domain. We anticipate this paper enabling researchers and users to well understand and deploy data repair algorithms in practice.