🤖 AI Summary
Single-cell RNA sequencing (scRNA-seq) data suffer from high technical noise, biological heterogeneity, and batch effects, undermining the stability and interpretability of conventional dimensionality reduction methods such as PCA. To address these challenges, we propose SPCA-RMT: a parameter-free framework integrating random matrix theory (RMT)-based eigenvalue selection, dual whitening preprocessing (inspired by Sinkhorn–Knopp scaling), and sparse PCA. Its key innovations include automatic, RMT-guided determination of optimal sparsity—eliminating manual hyperparameter tuning—and joint variance stabilization across both gene and cell dimensions to enhance subspace robustness. Comprehensive evaluation across seven major scRNA-seq protocols and four sparse PCA variants demonstrates that SPCA-RMT consistently outperforms PCA, autoencoders, and diffusion maps, achieving significant improvements in both subspace reconstruction accuracy and cell-type classification performance.
📝 Abstract
Single-cell RNA-seq provides detailed molecular snapshots of individual cells but is notoriously noisy. Variability stems from biological differences, PCR amplification bias, limited sequencing depth, and low capture efficiency, making it challenging to adapt computational pipelines to heterogeneous datasets or evolving technologies. As a result, most studies still rely on principal component analysis (PCA) for dimensionality reduction, valued for its interpretability and robustness. Here, we improve upon PCA with a Random Matrix Theory (RMT)-based approach that guides the inference of sparse principal components using existing sparse PCA algorithms. We first introduce a novel biwhitening method, inspired by the Sinkhorn-Knopp algorithm, that simultaneously stabilizes variance across genes and cells. This enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free. Our mathematically grounded approach retains the interpretability of PCA while enabling robust, hands-off inference of sparse principal components. Across seven single-cell RNA-seq technologies and four sparse PCA algorithms, we show that this method systematically improves the reconstruction of the principal subspace and consistently outperforms PCA-, autoencoder-, and diffusion-based methods in cell-type classification tasks.