Imputation for prediction: beware of diminishing returns

📅 2024-07-29

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

184K/year

🤖 AI Summary

The prevailing assumption that higher imputation accuracy necessarily improves downstream predictive performance lacks empirical validation. Method: We systematically evaluate the impact of imputation accuracy on prediction across 19 real and synthetic datasets, testing 12 imputation methods (e.g., mean, KNN, MICE, GAIN) combined with diverse linear and nonlinear predictors (e.g., XGBoost, MLP). Contribution/Results: Under expressive predictive models, gains in imputation accuracy yield negligible improvements in final prediction performance. Missingness indicators consistently and significantly enhance generalization across MCAR and multiple missingness mechanisms. Imputation accuracy only meaningfully affects prediction in linearly generated data—not in real-world datasets. These findings challenge the conventional “imputation-first” paradigm and advocate a prediction-oriented approach to missing-data handling. The study provides empirically grounded, efficient, and robust guidance for practical modeling, emphasizing task-relevant signal preservation over fidelity of imputed values.

Technology Category

Application Category

📝 Abstract

Missing values are prevalent across various fields, posing challenges for training and deploying predictive models. In this context, imputation is a common practice, driven by the hope that accurate imputations will enhance predictions. However, recent theoretical and empirical studies indicate that simple constant imputation can be consistent and competitive. This empirical study aims at clarifying if and when investing in advanced imputation methods yields significantly better predictions. Relating imputation and predictive accuracies across combinations of imputation and predictive models on 19 datasets, we show that imputation accuracy matters less i) when using expressive models, ii) when incorporating missingness indicators as complementary inputs, iii) matters much more for generated linear outcomes than for real-data outcomes. Interestingly, we also show that the use of the missingness indicator is beneficial to the prediction performance, even in MCAR scenarios. Overall, on real-data with powerful models, improving imputation only has a minor effect on prediction performance. Thus, investing in better imputations for improved predictions often offers limited benefits.

Problem

Research questions and friction points this paper is trying to address.

Investigates impact of advanced imputation on predictions.

Explores conditions favoring simple over complex imputation.

Assesses imputation's role in real-data predictive accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple constant imputation competes effectively.

Missingness indicators enhance prediction accuracy.

Expressive models reduce imputation importance.

🔎 Similar Papers

Learnable Prompt as Pseudo-Imputation: Rethinking the Necessity of Traditional EHR Data Imputation in Downstream Clinical Prediction