Iterative Data Curation with Theoretical Guarantees

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual verification of large-scale dynamic datasets is infeasible, leading to challenges in ensuring data accuracy. Method: This paper proposes a theory-driven, iterative data cleaning framework integrating error detection, automated correction, and progressive optimization. It formally models the iterative process, conducts accuracy testing, and performs probabilistic convergence analysis—validated via simulations and real-world case studies. Contribution/Results: The framework establishes the first rigorous theoretical guarantee of *probabilistic convergence to zero errors* for iterative data cleaning and proves that error detection accelerates error decay. Empirical results demonstrate that it significantly outperforms baseline methods in accuracy improvement and progressively approaches a fully correct dataset state, thereby unifying theoretical soundness with practical efficacy.

Technology Category

Application Category

📝 Abstract
In recent years, more and more large data sets have become available. Data accuracy, the absence of verifiable errors in data, is crucial for these large materials to enable high-quality research, downstream applications, and model training. This results in the problem of how to curate or improve data accuracy in such large and growing data, especially when the data is too large for manual curation to be feasible. This paper presents a unified procedure for iterative and continuous improvement of data sets. We provide theoretical guarantees that data accuracy tests speed up error reduction and, most importantly, that the proposed approach will, asymptotically, eliminate all errors in data with probability one. We corroborate the theoretical results with simulations and a real-world use case.
Problem

Research questions and friction points this paper is trying to address.

Developing automated methods to improve accuracy in large datasets
Providing theoretical guarantees for error reduction through iterative curation
Ensuring asymptotic elimination of all data errors with probability one
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative data curation procedure for continuous improvement
Theoretical guarantees for asymptotic error elimination
Validation through simulations and real-world use case
🔎 Similar Papers
2024-06-27Journal of Mathematical & Computer ApplicationsCitations: 2
V
Väinö Yrjänäinen
Department of Statistics, Uppsala University
J
Johan Jonasson
Department of Mathematical Sciences, Chalmers University of Technology
Måns Magnusson
Måns Magnusson
Department of Statistics, Uppsala University, Sweden
Bayesian StatisticsProbabilistic Machine LearningText-as-DataComputational Social Science