🤖 AI Summary
Statistical inference for categorical data under differential privacy (DP) faces a fundamental challenge: the coupling of privacy noise with sampling randomness obscures the sampling distribution of estimators, undermining confidence interval coverage and hypothesis test validity. To address this, we propose a simulation-matching method grounded in fiducial inference, which explicitly decouples the sampling mechanism from additive DP noise in the data-generating process. By reconstructing the noisy data generation with this separation, our approach yields a high-fidelity approximation of the estimator’s sampling distribution. This enables construction of confidence intervals with guaranteed coverage and hypothesis tests with robust statistical power. Extensive experiments across multiple synthetic and real-world categorical datasets demonstrate that our method significantly outperforms existing DP inference approaches in coverage accuracy, computational efficiency, and statistical power—establishing a new paradigm for verifiable statistical inference under differential privacy.
📝 Abstract
The task of statistical inference, which includes the building of confidence intervals and tests for parameters and effects of interest to a researcher, is still an open area of investigation in a differentially private (DP) setting. Indeed, in addition to the randomness due to data sampling, DP delivers another source of randomness consisting of the noise added to protect an individual's data from being disclosed to a potential attacker. As a result of this convolution of noises, in many cases it is too complicated to determine the stochastic behavior of the statistics and parameters resulting from a DP procedure. In this work, we contribute to this line of investigation by employing a simulation-based matching approach, solved through tools from the fiducial framework, which aims to replicate the data generation pipeline (including the DP step) and retrieve an approximate distribution of the estimates resulting from this pipeline. For this purpose, we focus on the analysis of categorical (nominal) data that is common in national surveys, for which sensitivity is naturally defined, and on additive privacy mechanisms. We prove the validity of the proposed approach in terms of coverage and highlight its good computational and statistical performance for different inferential tasks in simulated and applied data settings.