π€ AI Summary
Existing feature selection methods often rely on restrictive modeling assumptions (e.g., linearity), lack finite-sample false discovery rate (FDR) guarantees, or suffer from low statistical power in high-dimensional nonlinear settings. To address these limitations, we propose *Integrated Path Stability Selection* (IPSS), a nonparametric ensemble method that for the first time integrates stability selection with arbitrary nonparametric feature importance scoresβsuch as those from gradient boosting or random forests. IPSS employs pathwise integration over regularization paths and rigorous multiple testing correction to achieve strict, finite-sample FDR control. It further supports q-value estimation, substantially enhancing reliability in high dimensions. Experiments demonstrate that IPSS precisely controls FDR in RNA-seq simulations while achieving higher true positive rates than state-of-the-art methods; it processes 500 samples Γ 5,000 features in under 20 seconds; and in cancer miRNA/gene screening, it attains superior predictive accuracy using fewer selected features.
π Abstract
Feature selection is a critical task in machine learning and statistics. However, existing feature selection methods either (i) rely on parametric methods such as linear or generalized linear models, (ii) lack theoretical false discovery control, or (iii) identify few true positives. Here, we introduce a general feature selection method with finite-sample false discovery control based on applying integrated path stability selection (IPSS) to arbitrary feature importance scores. The method is nonparametric whenever the importance scores are nonparametric, and it estimates q-values, which are better suited to high-dimensional data than p-values. We focus on two special cases using importance scores from gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive nonlinear simulations with RNA sequencing data show that both methods accurately control the false discovery rate and detect more true positives than existing methods. Both methods are also efficient, running in under 20 seconds when there are 500 samples and 5000 features. We apply IPSSGB and IPSSRF to detect microRNAs and genes related to cancer, finding that they yield better predictions with fewer features than existing approaches.