Blockwise Missingness meets AI: A Tractable Solution for Semiparametric Inference

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Parameter estimation and statistical inference under block-wise, non-monotone missing data pose significant challenges. Method: This paper proposes a semiparametric approach integrating pretrained AI models. Its core innovations include (i) constructing optimal estimating equations based on restricted ANOVA hierarchy (RAY) approximation; (ii) incorporating prediction-augmented inference and a tunable, unbiased estimator; and (iii) achieving asymptotic variance control via constrained quadratic programming. Contribution/Results: We establish theoretical guarantees that the proposed estimator is unbiased and asymptotically efficient, and strictly dominates complete-case analysis under any pretrained model. Simulation studies and empirical analysis of surface protein abundance data demonstrate its robustness in small samples, high computational efficiency, and numerical stability. The method establishes a new paradigm for adaptive, efficient inference under complex missingness mechanisms.

Technology Category

Application Category

📝 Abstract

We consider parameter estimation and inference when data feature blockwise, non-monotone missingness. Our approach, rooted in semiparametric theory and inspired by prediction-powered inference, leverages off-the-shelf AI (predictive or generative) models to handle missing completely at random mechanisms, by finding an approximation of the optimal estimating equation through a novel and tractable Restricted Anova hierarchY (RAY) approximation. The resulting Inference for Blockwise Missingness(RAY), or IBM(RAY) estimator incorporates pre-trained AI models and carefully controls asymptotic variance by tuning model-specific hyperparameters. We then extend IBM(RAY) to a general class of estimators. We find the most efficient estimator in this class, which we call IBM(Adaptive), by solving a constrained quadratic programming problem. All IBM estimators are unbiased, and, crucially, asymptotically achieving guaranteed efficiency gains over a naive complete-case estimator, regardless of the predictive accuracy of the AI models used. We demonstrate the finite-sample performance and numerical stability of our method through simulation studies and an application to surface protein abundance estimation.

Problem

Research questions and friction points this paper is trying to address.

Estimating parameters with blockwise non-monotone missing data

Leveraging AI models for missing completely at random mechanisms

Achieving unbiased estimation with guaranteed asymptotic efficiency gains

Innovation

Methods, ideas, or system contributions that make the work stand out.

RAY approximation for optimal estimating equations

Leveraging pre-trained AI models for missing data

Constrained optimization for efficient estimator selection

🔎 Similar Papers

Learnable Prompt as Pseudo-Imputation: Rethinking the Necessity of Traditional EHR Data Imputation in Downstream Clinical Prediction