🤖 AI Summary
Manual verification of large-scale dynamic datasets is infeasible, leading to challenges in ensuring data accuracy.
Method: This paper proposes a theory-driven, iterative data cleaning framework integrating error detection, automated correction, and progressive optimization. It formally models the iterative process, conducts accuracy testing, and performs probabilistic convergence analysis—validated via simulations and real-world case studies.
Contribution/Results: The framework establishes the first rigorous theoretical guarantee of *probabilistic convergence to zero errors* for iterative data cleaning and proves that error detection accelerates error decay. Empirical results demonstrate that it significantly outperforms baseline methods in accuracy improvement and progressively approaches a fully correct dataset state, thereby unifying theoretical soundness with practical efficacy.
📝 Abstract
In recent years, more and more large data sets have become available. Data accuracy, the absence of verifiable errors in data, is crucial for these large materials to enable high-quality research, downstream applications, and model training. This results in the problem of how to curate or improve data accuracy in such large and growing data, especially when the data is too large for manual curation to be feasible. This paper presents a unified procedure for iterative and continuous improvement of data sets. We provide theoretical guarantees that data accuracy tests speed up error reduction and, most importantly, that the proposed approach will, asymptotically, eliminate all errors in data with probability one. We corroborate the theoretical results with simulations and a real-world use case.