🤖 AI Summary
This paper addresses the efficient approximation of the statistical similarity measure $ S_{ ext{stat}}(P,Q) = sum_x min(P(x),Q(x)) $ between probability distributions. For structured distribution families—particularly product distributions—it introduces the first fully polynomial-time deterministic approximation scheme (FPTAS), yielding deterministic $ varepsilon $-approximations in time $ O( ext{poly}(n,1/varepsilon)) $. Methodologically, it proposes the novel “masked knapsack problem” and develops a multidimensional FPTAS for it. Furthermore, it establishes NP-hardness for computing $ S_{ ext{stat}} $ exactly when $ P $ and $ Q $ are specified by Bayesian networks with in-degree at most two. The results precisely delineate the boundary between distribution classes for which $ S_{ ext{stat}} $ is efficiently approximable versus inapproximable, unifying techniques from combinatorial optimization, dynamic programming, and probabilistic structure analysis. This work provides both theoretical foundations and practical algorithms for comparing high-dimensional probability distributions.
📝 Abstract
We study the problem of computing statistical similarity between probability distributions. For distributions $P$ and $Q$ over a finite sample space, their statistical similarity is defined as $S_{mathrm{stat}}(P, Q) := sum_{x} min(P(x), Q(x))$. Statistical similarity is a basic measure of similarity between distributions, with several natural interpretations, and captures the Bayes error in prediction and hypothesis testing problems. Recent work has established that, somewhat surprisingly, even for the simple class of product distributions, exactly computing statistical similarity is $#mathsf{P}$-hard. This motivates the question of designing approximation algorithms for statistical similarity. Our primary contribution is a Fully Polynomial-Time deterministic Approximation Scheme (FPTAS) for estimating statistical similarity between two product distributions. To obtain this result, we introduce a new variant of the Knapsack problem, which we call the Masked Knapsack problem, and design an FPTAS to estimate the number of solutions of a multidimensional version of this problem. This new technical contribution could be of independent interest. Furthermore, we also establish a complementary hardness result. We show that it is $mathsf{NP}$-hard to estimate statistical similarity when $P$ and $Q$ are Bayes net distributions of in-degree $2$.