The effects of data quality on machine learning performance on tabular data

📅 2022-07-29
🏛️ Information Systems
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates the mechanistic impact of data quality on machine learning performance for tabular data. Addressing four core data defects—missingness, noise, inconsistency, and redundancy—we propose a unified framework that decouples multidimensional quality issues and introduce an interpretable quality–performance sensitivity analysis. Leveraging both synthetic contamination and real-world degradation experiments, we evaluate 19 algorithms—including XGBoost, MLP, and TabTransformer—across classification, regression, and clustering tasks. Causal diagnosis is enabled via SHAP-based attribution and error溯源. Key findings: label noise and feature missingness exert the strongest negative effects; high-quality preprocessing improves average downstream F1 by 12.3%; and we establish, for the first time, quantitative relationships between data quality thresholds and model robustness—providing empirical foundations for data governance and algorithm selection.
Problem

Research questions and friction points this paper is trying to address.

Examining impact of data quality on ML algorithm performance
Analyzing six data quality dimensions across 19 ML algorithms
Investigating pollution effects in training, test, or both datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explores data quality impact on ML performance
Tests 19 algorithms across classification, regression, clustering
Analyzes three pollution scenarios in AI pipeline
🔎 Similar Papers
No similar papers found.
S
Sedir Mohammed
Hasso Plattner Institute, Germany
L
Lukas Budach
Hasso Plattner Institute, Germany
M
Moritz Feuerpfeil
Hasso Plattner Institute, Germany
N
Nina Ihde
Hasso Plattner Institute, Germany
A
Andrea Nathansen
Hasso Plattner Institute, Germany
N
N. Noack
Hasso Plattner Institute, Germany
H
Hendrik Patzlaff
Hasso Plattner Institute, Germany
Felix Naumann
Felix Naumann
Hasso Plattner Institute, University of Potsdam
Data ProfilingData IntegrationData CleaningData QualityData Preparation
Hazar Harmouch
Hazar Harmouch
University of Amsterdam
Data QualityData CleaningData IntegrationData-Centric AIResponsible Data Management