🤖 AI Summary
The excess risk of PCA is governed by the geometric structure of the data distribution—specifically, the eigenvalue spectrum decay and subspace curvature. To address this, we establish a central limit theorem for principal subspace estimation error on the Grassmann manifold. Crucially, we discover that the negative block Rayleigh quotient exhibits generalized autocovariance along specific geodesics—a novel geometric property enabling non-asymptotic risk analysis. Integrating tools from random matrix theory, differential geometry, and asymptotic statistical inference, we derive a tight non-asymptotic upper bound on excess risk that recovers exact asymptotic behavior and precisely characterizes the limiting distribution of reconstruction error. Our core contributions are threefold: (i) uncovering the intrinsic geometric nature of PCA risk; (ii) constructing the first asymptotic error theory for PCA with explicit geometric interpretation; and (iii) transcending classical spectral analysis limitations through a principled manifold-based framework.
📝 Abstract
What property of the data distribution determines the excess risk of principal component analysis? In this paper, we provide a precise answer to this question. We establish a central limit theorem for the error of the principal subspace estimated by PCA, and derive the asymptotic distribution of its excess risk under the reconstruction loss. We obtain a non-asymptotic upper bound on the excess risk of PCA that recovers, in the large sample limit, our asymptotic characterization. Underlying our contributions is the following result: we prove that the negative block Rayleigh quotient, defined on the Grassmannian, is generalized self-concordant along geodesics emanating from its minimizer of maximum rotation less than $π/4$.