🤖 AI Summary
Prognostic biomarker discovery in high-dimensional multi-omics pancreatic cancer data suffers from the curse of dimensionality, poor stability across datasets, and arbitrary threshold dependence. Method: We propose a hybrid ensemble feature selection framework that integrates embedded (CoxLasso) and wrapper (survival SVM, random survival forest) approaches. A resampling-driven, multi-model–multi-subsample voting scheme enables robust feature ranking, while Pareto frontier analysis automatically determines the optimal feature set size—eliminating manual thresholding. Implemented efficiently via mlr3fselect, the method was validated across three independent pancreatic cancer cohorts. Results: It reduced biomarker counts by 62% on average, significantly improved stability (Jaccard similarity +0.31), and maintained predictive performance comparable to CoxLasso alone (ΔC-index < 0.02), thus balancing clinical interpretability with prediction reliability.
📝 Abstract
Prediction of patient survival using high-dimensional multi-omics data requires systematic feature selection methods that ensure predictive performance, sparsity, and reliability for prognostic biomarker discovery. We developed a hybrid ensemble feature selection (hEFS) approach that combines data subsampling with multiple prognostic models, integrating both embedded and wrapper-based strategies for survival prediction. Omics features are ranked using a voting-theory-inspired aggregation mechanism across models and subsamples, while the optimal number of features is selected via a Pareto front, balancing predictive accuracy and model sparsity without any user-defined thresholds. When applied to multi-omics datasets from three pancreatic cancer cohorts, hEFS identifies significantly fewer and more stable biomarkers compared to the conventional, late-fusion CoxLasso models, while maintaining comparable discrimination performance. Implemented within the open-source mlr3fselect R package, hEFS offers a robust, interpretable, and clinically valuable tool for prognostic modelling and biomarker discovery in high-dimensional survival settings.