🤖 AI Summary
This paper studies property testing and estimation of the average distribution $p_{ ext{avg}}$ over $T$ heterogeneous, non-i.i.d. distributions, where only $c$ samples are available per distribution. We establish the first systematic characterization of a phase transition in sample complexity: when $c = 1$, $Omega(k/varepsilon^2)$ samples are necessary—linear in the support size $k$; in contrast, for $c geq 2$, sample complexity drops to sublinear $O(sqrt{k}/varepsilon^2 + 1/varepsilon^4)$, recovering the efficiency of the i.i.d. setting. Crucially, we show that naive aggregation fails at $c = 2$, and there exist instances requiring $Omega(
ho k)$ samples—rendering the problem information-theoretically intractable under linear sampling budgets. Technically, our analysis integrates total variation distance bounds, moment estimation, coupling constructions, and information-theoretic lower bounds. We provide tight sample complexity characterizations for uniformity and identity testing, as well as for learning $p_{ ext{avg}}$.
📝 Abstract
We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $ extbf{p}_1, extbf{p}_2,ldots, extbf{p}_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $ extbf{p}_{mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $Theta(k/varepsilon^2)$ samples are necessary and sufficient to learn $ extbf{p}_{mathrm{avg}}$ to within error $varepsilon$ in TV distance. To test uniformity or identity -- distinguishing the case that $ extbf{p}_{mathrm{avg}}$ is equal to some reference distribution, versus has $ell_1$ distance at least $varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c ge 2$, we recover the usual sublinear sample testing of the i.i.d. setting: we show that $O(sqrt{k}/varepsilon^2 + 1/varepsilon^4)$ samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $varepsilon ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $
ho>0$ such that even in the linear regime with $
ho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $ extbf{p}_i$) can perform uniformity testing.