A Unified Framework for Inference with General Missingness Patterns and Machine Learning Imputation

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing missing-data imputation methods predominantly assume missingness completely at random (MCAR), rendering them inadequate for more realistic missing-at-random (MAR) mechanisms and complex missingness patterns. Method: We propose the first unified inference framework for Z-estimation under arbitrary missingness mechanisms, leveraging general-purpose machine learning imputation. Our approach hierarchically models missingness patterns and jointly models masking and imputation to construct a weighted aggregated estimator compatible with mainstream weighted inference tools. Contribution/Results: We establish asymptotic normality of the estimator and rigorously prove its statistical efficiency dominates that of weighted complete-case analysis. Extensive simulations demonstrate the method’s effectiveness, robustness, and superiority across diverse MAR settings, including those with heterogeneous missingness probabilities and high-dimensional covariates.

Technology Category

Application Category

📝 Abstract

Pre-trained machine learning (ML) predictions have been increasingly used to complement incomplete data to enable downstream scientific inquiries, but their naive integration risks biased inferences. Recently, multiple methods have been developed to provide valid inference with ML imputations regardless of prediction quality and to enhance efficiency relative to complete-case analyses. However, existing approaches are often limited to missing outcomes under a missing-completely-at-random (MCAR) assumption, failing to handle general missingness patterns under the more realistic missing-at-random (MAR) assumption. This paper develops a novel method which delivers valid statistical inference framework for general Z-estimation problems using ML imputations under the MAR assumption and for general missingness patterns. The core technical idea is to stratify observations by distinct missingness patterns and construct an estimator by appropriately weighting and aggregating pattern-specific information through a masking-and-imputation procedure on the complete cases. We provide theoretical guarantees of asymptotic normality of the proposed estimator and efficiency dominance over weighted complete-case analyses. Practically, the method affords simple implementations by leveraging existing weighted complete-case analysis software. Extensive simulations are carried out to validate theoretical results. The paper concludes with a brief discussion on practical implications, limitations, and potential future directions.

Problem

Research questions and friction points this paper is trying to address.

Handling general missingness patterns under MAR assumption

Providing valid inference with ML imputations for Z-estimation

Addressing bias risks from naive integration of ML predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

MAR assumption for general missingness patterns

Stratification and weighting by missingness patterns

Masking-and-imputation procedure on complete cases

🔎 Similar Papers

Learnable Prompt as Pseudo-Imputation: Rethinking the Necessity of Traditional EHR Data Imputation in Downstream Clinical Prediction