🤖 AI Summary
Stability selection theoretically controls the expected number of false positives, E(FP), but existing upper bounds on E(FP) are overly loose, resulting in low feature recall. To address this, we propose Integral Path Stability Selection (IPSS), which replaces the conventional maximum-value aggregation with path integration along the regularization path. This reformulation yields a substantially tighter—by several orders of magnitude—rigorous upper bound on E(FP), enabling significantly higher true positive rates under identical E(FP) constraints. IPSS preserves the computational efficiency and parameter simplicity of standard stability selection, requires no additional hyperparameter tuning, and seamlessly integrates with resampling-based inference as well as FDR or E(FP)-constrained optimization. Evaluated on real-world prostate and colon cancer datasets, alongside multiple simulation studies, IPSS achieves average improvements of 37–62% in true positive rate at fixed E(FP) targets, with computational cost identical to the baseline algorithm.
📝 Abstract
Stability selection is a popular method for improving feature selection algorithms. One of its key attributes is that it provides theoretical upper bounds on the expected number of false positives, E(FP), enabling control of false positives in practice. However, stability selection often selects very few features, resulting in low sensitivity. This is because existing bounds on E(FP) are relatively loose, causing stability selection to overestimate the number of false positives. In this paper, we introduce a novel approach to stability selection based on integrating stability paths rather than maximizing over them. This yields upper bounds on E(FP) that are orders of magnitude stronger than previous bounds, leading to significantly more true positives in practice for the same target E(FP). Furthermore, our method takes the same amount of computation as the original stability selection algorithm, and only requires one user-specified parameter, which can be either the target E(FP) or target false discovery rate. We demonstrate the method on simulations and real data from prostate and colon cancer studies.