Learning Dependency Models for Subset Repair

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Value inconsistency is pervasive in real-world data; existing minimal removal-set repair methods often suffer from non-unique solutions, hindering decision-making, and reliance on mode-based guidance frequently leads to erroneous repairs. Method: This paper formally defines the “optimal subset repair under attribute dependencies” problem and proposes a unified repair framework integrating functional dependencies (FDs) and multivalued dependencies (MVDs). We design an approximation algorithm leveraging clique-structure detection and linear programming (LP) relaxation, and develop a probabilistic repair model with theoretically guaranteed error bounds. Contribution/Results: Our approach mitigates bias toward high-frequency values, enhancing repair plausibility and correctness. Experiments on real-world datasets demonstrate significantly higher repair accuracy than state-of-the-art baselines. Moreover, downstream tasks—including classification and clustering—achieve substantial performance gains when executed on the cleaned data.

Technology Category

Application Category

📝 Abstract

Inconsistent values are commonly encountered in real-world applications, which can negatively impact data analysis and decision-making. While existing research primarily focuses on identifying the smallest removal set to resolve inconsistencies, recent studies have shown that multiple minimum removal sets may exist, making it difficult to make further decisions. While some approaches use the most frequent values as the guidance for the subset repair, this strategy has been criticized for its potential to inaccurately identify errors. To address these issues, we consider the dependencies between attribute values to determine a more appropriate subset repair. Our main contributions include (1) formalizing the optimal subset repair problem with attribute dependencies and analyzing its computational hardness; (2) computing the exact solution using integer linear programming; (3) developing an approximate algorithm with performance guarantees based on cliques and LP relaxation; and (4) designing a probabilistic approach with an approximation bound for efficiency. Experimental results on real-world datasets validate the effectiveness of our methods in both subset repair performance and downstream applications.

Problem

Research questions and friction points this paper is trying to address.

Addresses inconsistencies in real-world data affecting analysis and decisions

Focuses on subset repair using attribute dependencies for accuracy

Develops exact and approximate algorithms for efficient inconsistency resolution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using attribute dependencies for subset repair

Exact solution via integer linear programming

Approximate algorithm with performance guarantees

🔎 Similar Papers

Automated Test Case Repair Using Language Models