Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

130K/year

🤖 AI Summary

To address the bottleneck in tabular data cleaning—reliance on domain experts to manually define quality constraints—this paper proposes Semantic-Domain Constraints (SDCs), a novel, fully automated, and cross-domain generalizable framework for error detection and correction without human annotation. Methodologically, we formally define generalizable SDCs and introduce an unsupervised learning framework integrating large-scale statistical hypothesis testing, constraint-set optimization via distillation, and table-structure-aware semantic modeling—ensuring both theoretical soundness and practical efficacy. Experiments across 2,400 real-world columns demonstrate that our approach achieves high-precision error detection independently and significantly boosts the performance of existing expert-driven methods. Furthermore, we release the first benchmark dataset dedicated to semantic constraint learning, along with open-source code and implementation tools.

Technology Category

Application Category

📝 Abstract

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied. In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints. Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at https://github.com/qixuchen/AutoTest to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Detect data errors in tables without manual domain-expert input

Learn semantic-domain constraints automatically from table corpora

Augment existing data-cleaning techniques with complementary constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically infer Semantic-Domain Constraints for tables

Learn constraints using large-scale statistical tests

Optimize constraints with provable quality guarantees

🔎 Similar Papers

No similar papers found.