🤖 AI Summary
Traditional sparse PCA suffers from numerical instability, poor recovery of sparse feature supports, and difficulty in jointly optimizing dimensionality reduction with downstream tasks—such as polygenic risk score (PRS) prediction and clustering—on high-dimensional, sparse genomic data. To address these challenges, this paper proposes a differentiable sparse PCA framework based on the smooth L1 penalty. By replacing the non-differentiable LASSO constraint with an analytically differentiable smooth L1 regularizer, and integrating SVD-based orthogonality constraints with L-BFGS optimization, our method enables stable, joint estimation of higher-order principal components for the first time. Experiments on the 1000 Genomes dataset demonstrate substantial improvements: enhanced numerical stability and sparse support recovery, average PRS prediction accuracy gains of 4.2%, and a 12.7% increase in clustering silhouette score. The proposed approach consistently outperforms seven state-of-the-art sparse PCA methods across all evaluated metrics.
📝 Abstract
Principal components computed via PCA (principal component analysis) are traditionally used to reduce dimensionality in genomic data or to correct for population stratification. In this paper, we explore the penalized eigenvalue problem (PEP) which reformulates the computation of the first eigenvector as an optimization problem and adds an $L_1$ penalty constraint to enforce sparseness of the solution. The contribution of our article is threefold. First, we extend PEP by applying smoothing to the original LASSO-type $L_1$ penalty. This allows one to compute analytical gradients which enable faster and more efficient minimization of the objective function associated with the optimization problem. Second, we demonstrate how higher order eigenvectors can be calculated with PEP using established results from singular value decomposition (SVD). Third, we present four experimental studies to demonstrate the usefulness of the smoothed penalized eigenvectors. Using data from the 1000 Genomes Project dataset, we empirically demonstrate that our proposed smoothed PEP allows one to increase numerical stability and obtain meaningful eigenvectors. We also employ the penalized eigenvector approach in two additional real data applications (computation of a polygenic risk score and clustering), demonstrating that exchanging the penalized eigenvectors for their smoothed counterparts can increase prediction accuracy in polygenic risk scores and enhance discernibility of clusterings. Moreover, we compare our proposed smoothed PEP to seven state-of-the-art algorithms for sparse PCA and evaluate the accuracy of the obtained eigenvectors, their support recovery, and their runtime.