Tab-Shapley: Identifying Top-k Tabular Data Quality Insights

📅 2025-01-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Data quality issues in complex relational tables often stem from hidden root causes, while conventional frequency-based approaches neglect attribute dependencies. Method: This paper proposes the first unsupervised Top-k data quality insight discovery method grounded in Shapley values. It models attribute combinations as players in a cooperative game and derives, for the first time, an analytical closed-form solution for Shapley values. Integrated with subset-space pruning and an efficient solving algorithm, the method precisely identifies anomalous attribute subsets and their corresponding anomalous record subsets. Contribution/Results: Evaluated on diverse real-world multi-source tabular datasets, the method significantly outperforms baseline approaches. It offers strong interpretability—enabling attribution of data quality issues to specific attribute interactions—and operational utility by supporting root-cause-level data quality remediation.

Technology Category

Application Category

📝 Abstract

We present an unsupervised method for aggregating anomalies in tabular datasets by identifying the top-k tabular data quality insights. Each insight consists of a set of anomalous attributes and the corresponding subsets of records that serve as evidence to the user. The process of identifying these insight blocks is challenging due to (i) the absence of labeled anomalies, (ii) the exponential size of the subset search space, and (iii) the complex dependencies among attributes, which obscure the true sources of anomalies. Simple frequency-based methods fail to capture these dependencies, leading to inaccurate results. To address this, we introduce Tab-Shapley, a cooperative game theory based framework that uses Shapley values to quantify the contribution of each attribute to the data's anomalous nature. While calculating Shapley values typically requires exponential time, we show that our game admits a closed-form solution, making the computation efficient. We validate the effectiveness of our approach through empirical analysis on real-world tabular datasets with ground-truth anomaly labels.

Problem

Research questions and friction points this paper is trying to address.

Data Quality

Automated Detection

Complex Data Relationships

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tab-Shapley

Shapley Values

Efficient Computation

🔎 Similar Papers

No similar papers found.