Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data quality issues in complex relational tables often stem from hidden root causes, while conventional frequency-based approaches neglect attribute dependencies. Method: This paper proposes the first unsupervised Top-k data quality insight discovery method grounded in Shapley values. It models attribute combinations as players in a cooperative game and derives, for the first time, an analytical closed-form solution for Shapley values. Integrated with subset-space pruning and an efficient solving algorithm, the method precisely identifies anomalous attribute subsets and their corresponding anomalous record subsets. Contribution/Results: Evaluated on diverse real-world multi-source tabular datasets, the method significantly outperforms baseline approaches. It offers strong interpretability—enabling attribution of data quality issues to specific attribute interactions—and operational utility by supporting root-cause-level data quality remediation.

Technology Category

Application Category

📝 Abstract
We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.
Problem

Research questions and friction points this paper is trying to address.

Data Quality
Automated Detection
Complex Data Relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tab-Shapley
Shapley Values
Efficient Computation
🔎 Similar Papers
No similar papers found.