🤖 AI Summary
This paper addresses the utility degradation caused by excessive noise injection in differential privacy, focusing on the privacy–utility trade-off in private principal component analysis (PCA) under the high-dimensional limit. We propose a sharp high-dimensional privacy characterization based on the exponential mechanism, recasting individual presence detection as hypothesis testing between Gaussian distributions with infinitesimally differing means. Our approach integrates the Dong–Roth–Su framework for privacy via hypothesis testing with Le Cam’s theory of contiguity. Under the asymptotic regime where dimension $p o infty$, we derive, for the first time, the exact critical threshold for the noise level—surpassing conventional conservative bounds and achieving minimally necessary noise injection. This result substantially improves PCA utility and establishes a new paradigm for tight privacy analysis in high-dimensional statistical learning.
📝 Abstract
In differential privacy, statistics of a sensitive dataset are privatized by introducing random noise. Most privacy analyses provide privacy bounds specifying a noise level sufficient to achieve a target privacy guarantee. Sometimes, these bounds are pessimistic and suggest adding excessive noise, which overwhelms the meaningful signal. It remains unclear if such high noise levels are truly necessary or a limitation of the proof techniques. This paper explores whether we can obtain sharp privacy characterizations that identify the smallest noise level required to achieve a target privacy level for a given mechanism. We study this problem in the context of differentially private principal component analysis, where the goal is to privatize the leading principal components (PCs) of a dataset with n samples and p features. We analyze the exponential mechanism for this problem in a model-free setting and provide sharp utility and privacy characterizations in the high-dimensional limit ($p
ightarrowinfty$). Our privacy result shows that, in high dimensions, detecting the presence of a target individual in the dataset using the privatized PCs is exactly as hard as distinguishing two Gaussians with slightly different means, where the mean difference depends on certain spectral properties of the dataset. Our privacy analysis combines the hypothesis-testing formulation of privacy guarantees proposed by Dong, Roth, and Su (2022) with classical contiguity arguments due to Le Cam to obtain sharp high-dimensional privacy characterizations.