🤖 AI Summary
This study systematically investigates the mechanistic impact of data quality on machine learning performance for tabular data. Addressing four core data defects—missingness, noise, inconsistency, and redundancy—we propose a unified framework that decouples multidimensional quality issues and introduce an interpretable quality–performance sensitivity analysis. Leveraging both synthetic contamination and real-world degradation experiments, we evaluate 19 algorithms—including XGBoost, MLP, and TabTransformer—across classification, regression, and clustering tasks. Causal diagnosis is enabled via SHAP-based attribution and error溯源. Key findings: label noise and feature missingness exert the strongest negative effects; high-quality preprocessing improves average downstream F1 by 12.3%; and we establish, for the first time, quantitative relationships between data quality thresholds and model robustness—providing empirical foundations for data governance and algorithm selection.