PCA of probability measures: Sparse and Dense sampling regimes

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of theoretical characterization of the double asymptotic behavior—where both the number of samples \(n\) and the number of observations per measure \(m\) grow— in multi-probability-measure principal component analysis (PCA). Working within the framework of reproducing kernel Hilbert space embeddings and leveraging functional PCA together with double asymptotic analysis, the authors derive convergence rates for both the estimated covariance operator and the excess risk of PCA. They uncover, for the first time, a phase transition in convergence rates across sampling regimes, ranging from sparse (\(m\) small) to dense (\(m\) large), and establish that the dense-regime rate is minimax optimal in terms of empirical covariance estimation error. The theoretical convergence rate takes the form \(n^{-1/2} + m^{-\alpha}\), which is corroborated by numerical experiments; these further demonstrate that judicious subsampling can substantially reduce computational cost while preserving statistical accuracy.

Technology Category

Application Category

📝 Abstract
A common approach to perform PCA on probability measures is to embed them into a Hilbert space where standard functional PCA techniques apply. While convergence rates for estimating the embedding of a single measure from $m$ samples are well understood, the literature has not addressed the setting involving multiple measures. In this paper, we study PCA in a double asymptotic regime where $n$ probability measures are observed, each through $m$ samples. We derive convergence rates of the form $n^{-1/2} + m^{-\alpha}$ for the empirical covariance operator and the PCA excess risk, where $\alpha>0$ depends on the chosen embedding. This characterizes the relationship between the number $n$ of measures and the number $m$ of samples per measure, revealing a sparse (small $m$) to dense (large $m$) transition in the convergence behavior. Moreover, we prove that the dense-regime rate is minimax optimal for the empirical covariance error. Our numerical experiments validate these theoretical rates and demonstrate that appropriate subsampling preserves PCA accuracy while reducing computational cost.
Problem

Research questions and friction points this paper is trying to address.

PCA
probability measures
sparse sampling
dense sampling
convergence rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

PCA of probability measures
double asymptotic regime
sparse and dense sampling
minimax optimality
Hilbert space embedding