🤖 AI Summary
This work addresses a systematic statistical bias introduced by random sampling in high-dimensional least squares and CUR decomposition, stemming from the nonlinear pseudoinverse operation—an effect unaccounted for by classical subspace embedding theory. The paper establishes a unified non-asymptotic analytical framework for high-dimensional randomized oblique projections, enabling the first systematic identification and quantification of this bias and revealing the suboptimality of prevailing sampling schemes. Leveraging a bias–variance decomposition, the authors propose a general debiasing estimator that substantially improves the approximation accuracy of subsampled ordinary least squares and CUR decomposition. Both theoretical analysis and empirical experiments demonstrate that the proposed debiased CUR algorithm achieves superior statistical efficiency while remaining computationally feasible in high-dimensional settings.
📝 Abstract
Random sampling is a fundamental tool in modern machine learning and numerical linear algebra for reducing the computational cost of large-scale matrix problems. Existing analyses, however, rely primarily on subspace embedding guarantees, which do not precisely characterize the statistical bias of nonlinear random oblique projections induced by sampling, which arises ubiquitously in subsampled least squares and fast low-rank approximation methods. Because (pseudo)inversion is nonlinear, these random oblique projections can be systematically biased even when the underlying sketch is unbiased, thereby introducing hidden bias into downstream least squares and low-rank approximation solutions.
In this work, we develop a unified non-asymptotic theory for random oblique projections in high dimensions. We show that standard random sampling schemes generally induce a systematic statistical bias overlooked by classical subspace embedding-style analyses, and we propose a principled debiasing framework to correct it. We illustrate the power of the theory through two canonical applications. For subsampled least squares, we obtain sharp bias--variance characterizations, reveal previously unrecognized statistical suboptimality in widely used sampling schemes, and identify when debiasing yields provable improvements. For fast CUR decomposition, we develop a debiased approach with improved approximation accuracy. Numerical experiments further validate our theoretical findings.