Testing with Non-identically Distributed Samples

📅 2023-11-19

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This paper studies property testing and estimation of the average distribution $p_{ ext{avg}}$ over $T$ heterogeneous, non-i.i.d. distributions, where only $c$ samples are available per distribution. We establish the first systematic characterization of a phase transition in sample complexity: when $c = 1$, $Omega(k/varepsilon^2)$ samples are necessary—linear in the support size $k$; in contrast, for $c geq 2$, sample complexity drops to sublinear $O(sqrt{k}/varepsilon^2 + 1/varepsilon^4)$, recovering the efficiency of the i.i.d. setting. Crucially, we show that naive aggregation fails at $c = 2$, and there exist instances requiring $Omega( ho k)$ samples—rendering the problem information-theoretically intractable under linear sampling budgets. Technically, our analysis integrates total variation distance bounds, moment estimation, coupling constructions, and information-theoretic lower bounds. We provide tight sample complexity characterizations for uniformity and identity testing, as well as for learning $p_{ ext{avg}}$.

📝 Abstract

We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $ extbf{p}_1, extbf{p}_2,ldots, extbf{p}_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $ extbf{p}_{mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $Theta(k/varepsilon^2)$ samples are necessary and sufficient to learn $ extbf{p}_{mathrm{avg}}$ to within error $varepsilon$ in TV distance. To test uniformity or identity -- distinguishing the case that $ extbf{p}_{mathrm{avg}}$ is equal to some reference distribution, versus has $ell_1$ distance at least $varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c ge 2$, we recover the usual sublinear sample testing of the i.i.d. setting: we show that $O(sqrt{k}/varepsilon^2 + 1/varepsilon^4)$ samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $varepsilon ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $ ho>0$ such that even in the linear regime with $ ho k$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $ extbf{p}_i$) can perform uniformity testing.

Problem

Research questions and friction points this paper is trying to address.

Testing properties of average distributions with non-identically distributed samples

Determining sample complexity for uniformity and identity testing under heterogeneity

Analyzing limitations of testers ignoring sample origin in distribution testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Testing with non-identically distributed samples

Using multiple samples from each distribution

Achieving sublinear sample complexity guarantees

🔎 Similar Papers

A New Upper Bound for Distributed Hypothesis Testing Using the Auxiliary Receiver Approach