Missing Value Knockoffs

📅 2022-02-26
🏛️ arXiv.org
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Existing variable selection methods struggle to control the false discovery rate (FDR) under missing data, while model-X knockoffs—though theoretically guaranteed to control FDR—cannot directly accommodate missing values. This work establishes, for the first time, the theoretical FDR controllability of knockoffs in the presence of missing data. We propose three novel paradigms: (i) posterior sampling-based imputation and knockoff reuse, (ii) knockoff generation restricted to observed variables only, and (iii) joint latent-variable imputation and knockoff construction. Our approaches integrate Bayesian posterior sampling, univariate imputation, and latent-variable modeling, and we rigorously prove that they satisfy FDR ≤ α under standard assumptions. Extensive experiments demonstrate precise FDR control across diverse missingness mechanisms (MCAR, MAR, MNAR), variable correlation structures, and sample sizes, while achieving high statistical power and substantially reduced computational complexity compared to existing alternatives.
📝 Abstract
One limitation of the most statistical/machine learning-based variable selection approaches is their inability to control the false selections. A recently introduced framework, model-x knockoffs, provides that to a wide range of models but lacks support for datasets with missing values. In this work, we discuss ways of preserving the theoretical guarantees of the model-x framework in the missing data setting. First, we prove that posterior sampled imputation allows reusing existing knockoff samplers in the presence of missing values. Second, we show that sampling knockoffs only for the observed variables and applying univariate imputation also preserves the false selection guarantees. Third, for the special case of latent variable models, we demonstrate how jointly imputing and sampling knockoffs can reduce the computational complexity. We have verified the theoretical findings with two different exploratory variable distributions and investigated how the missing data pattern, amount of correlation, the number of observations, and missing values affected the statistical power.
Problem

Research questions and friction points this paper is trying to address.

Extending model-x knockoffs framework to handle missing data
Preserving false selection guarantees with imputation methods
Reducing computational complexity for latent variable models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Posterior sampled imputation preserves knockoff guarantees
Sampling knockoffs only for observed variables with imputation
Joint imputation and knockoff sampling reduces computational complexity
🔎 Similar Papers
No similar papers found.