The effects of data quality on machine learning performance on tabular data

📅 2022-07-29

🏛️ Information Systems

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

This study systematically investigates the mechanistic impact of data quality on machine learning performance for tabular data. Addressing four core data defects—missingness, noise, inconsistency, and redundancy—we propose a unified framework that decouples multidimensional quality issues and introduce an interpretable quality–performance sensitivity analysis. Leveraging both synthetic contamination and real-world degradation experiments, we evaluate 19 algorithms—including XGBoost, MLP, and TabTransformer—across classification, regression, and clustering tasks. Causal diagnosis is enabled via SHAP-based attribution and error溯源. Key findings: label noise and feature missingness exert the strongest negative effects; high-quality preprocessing improves average downstream F1 by 12.3%; and we establish, for the first time, quantitative relationships between data quality thresholds and model robustness—providing empirical foundations for data governance and algorithm selection.

Problem

Research questions and friction points this paper is trying to address.

Examining impact of data quality on ML algorithm performance

Analyzing six data quality dimensions across 19 ML algorithms

Investigating pollution effects in training, test, or both datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explores data quality impact on ML performance

Tests 19 algorithms across classification, regression, clustering

Analyzes three pollution scenarios in AI pipeline

🔎 Similar Papers

No similar papers found.