Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of scaling false discovery rate (FDR) control to high-dimensional variable selection problems involving millions of predictors, such as genome-wide association studies. The authors propose VD-T-Rex, a method that formalizes the information flow of forward selection and constructs an implicit sequential sampling mechanism that avoids explicit generation of synthetic null variables. By leveraging low-dimensional subspace projections, rotation-invariant conditional sampling, and the LARS algorithm—augmented with an adaptive stick-breaking construction and randomized early stopping—VD-T-Rex efficiently approximates the selection path of the original T-Rex procedure. Theoretical analysis establishes the universality of this path, showing it converges to a Gaussian limit under general i.i.d. null variables. Empirical results demonstrate that VD-T-Rex reduces memory usage and runtime by several orders of magnitude on real data while rigorously maintaining FDR control and statistical power, substantially outperforming existing scalable alternatives.
📝 Abstract
High-dimensional variable selection, particularly in genomics, requires error-controlling procedures that scale to millions of predictors. The Terminating-Random Experiments (T-Rex) selector achieves false discovery rate (FDR) control by aggregating results of early terminated random experiments, each combining original predictors with i.i.d. synthetic null variables (dummies). At biobank scales, however, explicit dummy augmentation requires terabytes of memory. We demonstrate that this bottleneck is not fundamental. Formalizing the information flow of forward selection through a filtration, we show that compatible selectors interact with unselected dummies solely through projections onto an adaptively evolving low-dimensional subspace. For rotationally invariant dummy distributions, we derive an adaptive stick-breaking construction sampling these projections from their exact conditional distribution given the selection history, thereby eliminating dummy matrix materialization. We prove a pathwise universality theorem: under mild delocalization conditions, selection paths driven by generic standardized i.i.d. dummies converge to the same Gaussian limit. We instantiate the theory through Virtual Dummy LARS (VD-LARS), reducing memory and runtime by several orders of magnitude while preserving the exact selection law and FDR guarantees of the T-Rex selector. Experiments on realistic genome-wide association study data confirm that VD-T-Rex controls FDR and achieves power at scales where all competing methods either fail or time out.
Problem

Research questions and friction points this paper is trying to address.

high-dimensional variable selection
false discovery rate
null features
scalability
genomics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual Dummies
FDR control
Sequential Sampling
High-dimensional Selection
Memory Efficiency
🔎 Similar Papers
No similar papers found.