Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy

📅 2025-03-14

🏛️ International Conference on Extending Database Technology

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address the challenge of balancing efficiency and model performance in data cleaning under resource constraints, this paper proposes a progressive cleaning optimization framework designed to maximize machine learning effectiveness. The method integrates error sensitivity analysis, incremental model evaluation, and a greedy selection strategy to dynamically recommend—per iteration—the most beneficial features to clean first. It supports adaptive handling of multiple ML algorithms and diverse error types, overcoming limitations of static cleaning pipelines and heuristic approaches based solely on feature importance. Experiments across multiple real-world datasets and mainstream ML models demonstrate an average prediction accuracy improvement of 5 percentage points, with gains up to 52 percentage points, significantly outperforming existing baselines. The core contribution is the first formulation of cleaning decisions as a sequence optimization problem explicitly targeting end-to-end model performance gain, enabling scalable, interpretable, and real-time cleaning recommendations.

Technology Category

Application Category

📝 Abstract

Data quality is crucial in machine learning (ML) applications, as errors in the data can significantly impact the prediction accuracy of the underlying ML model. Therefore, data cleaning is an integral component of any ML pipeline. However, in practical scenarios, data cleaning incurs significant costs, as it often involves domain experts for configuring and executing the cleaning process. Thus, efficient resource allocation during data cleaning can enhance ML prediction accuracy while controlling expenses. This paper presents COMET, a system designed to optimize data cleaning efforts for ML tasks. COMET gives step-by-step recommendations on which feature to clean next, maximizing the efficiency of data cleaning under resource constraints. We evaluated COMET across various datasets, ML algorithms, and data error types, demonstrating its robustness and adaptability. Our results show that COMET consistently outperforms feature importance-based, random, and another well-known cleaning method, achieving up to 52 and on average 5 percentage points higher ML prediction accuracy than the proposed baselines.

Problem

Research questions and friction points this paper is trying to address.

Optimizes data cleaning to improve ML prediction accuracy.

Reduces costs by efficient resource allocation in cleaning.

Provides step-by-step recommendations for feature cleaning priorities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

COMET optimizes data cleaning for ML tasks.

Provides step-by-step cleaning feature recommendations.

Enhances ML accuracy under resource constraints.

🔎 Similar Papers

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge