Univariate-Guided Sparse Regression for Biobank-Scale High-Dimensional -omics Data

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of unstable feature selection, poor model interpretability, and limited scalability to million-sample biobanks (e.g., UK Biobank) in polygenic risk score (PRS) estimation from high-dimensional genomic data, this paper proposes a two-stage penalized regression framework. First, univariate effect sizes serve as priors to guide Lasso-based sparse regression, substantially improving feature selection stability and model sparsity. Second, external summary statistics (e.g., GWAS summary statistics) are integrated to enhance prediction accuracy. Compared with state-of-the-art methods such as PRS-CS, our framework achieves comparable prediction performance to standard Lasso while selecting over 40% fewer genetic variants—significantly improving PRS interpretability and biological traceability. Moreover, the method exhibits strong computational scalability, enabling efficient analysis of ultra-large-scale biobanks.

Technology Category

Application Category

📝 Abstract
We present a scalable framework for computing polygenic risk scores (PRS) in high-dimensional genomic settings using the recently introduced Univariate-Guided Sparse Regression (uniLasso). UniLasso is a two-stage penalized regression procedure that leverages univariate coefficients and magnitudes to stabilize feature selection and enhance interpretability. Building on its theoretical and empirical advantages, we adapt uniLasso for application to the UK Biobank, a population-based repository comprising over one million genetic variants measured on hundreds of thousands of individuals from the United Kingdom. We further extend the framework to incorporate external summary statistics to increase predictive accuracy. Our results demonstrate that the adapted uniLasso attains predictive performance comparable to standard Lasso while selecting substantially fewer variants, yielding sparser and more interpretable models. Moreover, it exhibits superior performance in estimating PRS relative to its competitors, such as PRS-CS. Integrating external scores further improves prediction while maintaining sparsity.
Problem

Research questions and friction points this paper is trying to address.

Develop scalable polygenic risk score framework for biobank-scale genomic data
Enhance feature selection stability and model interpretability in high-dimensional omics
Integrate external summary statistics to improve prediction accuracy while maintaining sparsity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage penalized regression for polygenic risk scores
Leverages univariate coefficients to stabilize feature selection
Incorporates external summary statistics to enhance prediction
🔎 Similar Papers
No similar papers found.
J
Joshua Richland
Department of Statistics, Stanford University
T
Tuomo Kiiskinen
Department of Biomedical Data Science, Stanford University
William Wang
William Wang
Unknown affiliation
Non-volatile MemoriesComputer ArchitectureMicroarchitecture
S
Sophia Lu
Department of Statistics, Stanford University
Balasubramanian Narasimhan
Balasubramanian Narasimhan
Senior Research Scientist, Department of Biomedical Data Sciences and Department of Statistics
Statistical Computingmachine learningclinical trialsbioinformatics
M
Manuel Rivas
Department of Biomedical Data Science, Stanford University
Robert Tibshirani
Robert Tibshirani
Professor of Biomedical Data Sciences, and of Statistics, Stanford University
Statisticsdata scienceMachine Learning