The effects of data quality on machine learning performance on tabular data

📅 2022-07-29
🏛️ Information Systems
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the mechanistic impact of data quality on machine learning performance for tabular data. Addressing four core data defects—missingness, noise, inconsistency, and redundancy—we propose a unified framework that decouples multidimensional quality issues and introduce an interpretable quality–performance sensitivity analysis. Leveraging both synthetic contamination and real-world degradation experiments, we evaluate 19 algorithms—including XGBoost, MLP, and TabTransformer—across classification, regression, and clustering tasks. Causal diagnosis is enabled via SHAP-based attribution and error溯源. Key findings: label noise and feature missingness exert the strongest negative effects; high-quality preprocessing improves average downstream F1 by 12.3%; and we establish, for the first time, quantitative relationships between data quality thresholds and model robustness—providing empirical foundations for data governance and algorithm selection.
Problem

Research questions and friction points this paper is trying to address.

Examining impact of data quality on ML algorithm performance
Analyzing six data quality dimensions across 19 ML algorithms
Investigating pollution effects in training, test, or both datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explores data quality impact on ML performance
Tests 19 algorithms across classification, regression, clustering
Analyzes three pollution scenarios in AI pipeline
🔎 Similar Papers
No similar papers found.