🤖 AI Summary
The core challenge in data quality monitoring lies in error provenance—specifically, identifying the underlying mechanisms that generate errors—a problem largely overlooked by existing work, which seldom models such mechanisms explicitly. This paper focuses on errors arising from intrinsic dependencies within data and proposes MechDetect, the first method to systematically extend missing-data mechanism detection to diverse error types—including outliers, inconsistencies, and format violations. Leveraging joint statistical modeling and supervised learning, MechDetect simultaneously models tabular data and their error masks to automatically determine whether observed errors stem from inherent characteristics of the original data. Extensive experiments across multiple benchmark datasets demonstrate that MechDetect significantly outperforms state-of-the-art baselines in accurately diagnosing error-generation mechanisms. By providing mechanistic interpretability, it establishes a theoretical foundation and practical framework for explainable data repair.
📝 Abstract
Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.