🤖 AI Summary
Existing variable selection methods struggle to control the false discovery rate (FDR) under missing data, while model-X knockoffs—though theoretically guaranteed to control FDR—cannot directly accommodate missing values. This work establishes, for the first time, the theoretical FDR controllability of knockoffs in the presence of missing data. We propose three novel paradigms: (i) posterior sampling-based imputation and knockoff reuse, (ii) knockoff generation restricted to observed variables only, and (iii) joint latent-variable imputation and knockoff construction. Our approaches integrate Bayesian posterior sampling, univariate imputation, and latent-variable modeling, and we rigorously prove that they satisfy FDR ≤ α under standard assumptions. Extensive experiments demonstrate precise FDR control across diverse missingness mechanisms (MCAR, MAR, MNAR), variable correlation structures, and sample sizes, while achieving high statistical power and substantially reduced computational complexity compared to existing alternatives.
📝 Abstract
One limitation of the most statistical/machine learning-based variable selection approaches is their inability to control the false selections. A recently introduced framework, model-x knockoffs, provides that to a wide range of models but lacks support for datasets with missing values. In this work, we discuss ways of preserving the theoretical guarantees of the model-x framework in the missing data setting. First, we prove that posterior sampled imputation allows reusing existing knockoff samplers in the presence of missing values. Second, we show that sampling knockoffs only for the observed variables and applying univariate imputation also preserves the false selection guarantees. Third, for the special case of latent variable models, we demonstrate how jointly imputing and sampling knockoffs can reduce the computational complexity. We have verified the theoretical findings with two different exploratory variable distributions and investigated how the missing data pattern, amount of correlation, the number of observations, and missing values affected the statistical power.