MechDetect: Detecting Data-Dependent Errors

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The core challenge in data quality monitoring lies in error provenance—specifically, identifying the underlying mechanisms that generate errors—a problem largely overlooked by existing work, which seldom models such mechanisms explicitly. This paper focuses on errors arising from intrinsic dependencies within data and proposes MechDetect, the first method to systematically extend missing-data mechanism detection to diverse error types—including outliers, inconsistencies, and format violations. Leveraging joint statistical modeling and supervised learning, MechDetect simultaneously models tabular data and their error masks to automatically determine whether observed errors stem from inherent characteristics of the original data. Extensive experiments across multiple benchmark datasets demonstrate that MechDetect significantly outperforms state-of-the-art baselines in accurately diagnosing error-generation mechanisms. By providing mechanistic interpretability, it establishes a theoretical foundation and practical framework for explainable data repair.

Technology Category

Application Category

📝 Abstract
Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.
Problem

Research questions and friction points this paper is trying to address.

Detect data-dependent error generation mechanisms
Extend missing value analysis to other error types
Estimate error dependency using machine learning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses machine learning to detect data-dependent error mechanisms
Extends missing value analysis to general error types
Requires only tabular data and error mask for operation
🔎 Similar Papers
No similar papers found.
Philipp Jung
Philipp Jung
Berlin University of Applied Sciences and Technology, Berlin, Germany
N
Nicholas Chandler
Berlin University of Applied Sciences and Technology, Berlin, Germany
S
Sebastian Jäger
Berlin University of Applied Sciences and Technology, Berlin, Germany
Felix Biessmann
Felix Biessmann
Einstein Center Digital Future, Berlin University of Applied Sciences
Machine Learning