🤖 AI Summary
This paper addresses the problem of estimating the top-$k$ principal components of a covariance matrix $Sigma$ from a collection of random matrices under differential privacy. Existing methods suffer from limitations: requiring superlinear sample complexity in dimension ($n gg d$), excessive noise injection, or applicability only to $k=1$. We propose the first efficient differentially private algorithm supporting arbitrary $k leq d$. Our method builds upon an iterative optimization framework with adaptive noise injection, leveraging intrinsic data randomness to reduce privacy cost. We establish theoretical guarantees showing near-optimal statistical error with only $n = ilde{O}(d)$ samples; for $k=1$, our error matches the information-theoretic lower bound. We further provide tight upper and lower bounds characterizing the fundamental trade-off. Experiments demonstrate that our approach significantly outperforms existing baselines in the privacy–utility trade-off.
📝 Abstract
Given $n$ i.i.d. random matrices $A_i in mathbb{R}^{d imes d}$ that share a common expectation $Σ$, the objective of Differentially Private Stochastic PCA is to identify a subspace of dimension $k$ that captures the largest variance directions of $Σ$, while preserving differential privacy (DP) of each individual $A_i$. Existing methods either (i) require the sample size $n$ to scale super-linearly with dimension $d$, even under Gaussian assumptions on the $A_i$, or (ii) introduce excessive noise for DP even when the intrinsic randomness within $A_i$ is small. Liu et al. (2022a) addressed these issues for sub-Gaussian data but only for estimating the top eigenvector ($k=1$) using their algorithm DP-PCA. We propose the first algorithm capable of estimating the top $k$ eigenvectors for arbitrary $k leq d$, whilst overcoming both limitations above. For $k=1$ our algorithm matches the utility guarantees of DP-PCA, achieving near-optimal statistical error even when $n = ilde{!O}(d)$. We further provide a lower bound for general $k > 1$, matching our upper bound up to a factor of $k$, and experimentally demonstrate the advantages of our algorithm over comparable baselines.