Distribution Testing in the Presence of Arbitrarily Dominant Noise with Verification Queries

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work studies efficient distribution testing under two challenging conditions: (i) only a small number of relevant samples are accessible, and (ii) the observed data is arbitrarily corrupted by dominant noise. To address this, we propose a novel “verification query” model—a query mechanism designed to statistically distinguish the target distribution from noise without direct access to clean samples. Our theoretical analysis establishes a smooth trade-off between sample complexity and query complexity. We derive the first tight upper and lower bounds for uniformity, identity, and closeness testing in this setting. Crucially, when the probability density function (PDF) of the mixture distribution is available, our approach breaks classical lower bounds, significantly reducing the required number of queries while provably robust against adaptive adversarial noise.

Technology Category

Application Category

📝 Abstract

We study distribution testing without direct access to a source of relevant data, but rather to one where only a tiny fraction is relevant. To enable this, we introduce the following verification query model. The goal is to perform a statistical task on distribution $oldsymbol{p}$ given sample access to a mixture $oldsymbol{r} = λoldsymbol{p} + (1-λ)oldsymbol{q}$ and the ability to query whether a sample was generated by $oldsymbol{p}$ or by $oldsymbol{q}$. In general, if $m_0$ samples from $oldsymbol{p}$ suffice for a task, then $O(m_0/λ)$ samples and queries always suffice in our model. Are there tasks for which the number of queries can be significantly reduced? We study the canonical problems in distribution testing, and obtain matching upper and lower bounds that reveal smooth trade-offs between sample and query complexity. For all $m leq n$, we obtain (i) a uniformity and identity tester using $O(m + frac{sqrt{n}}{varepsilon^2 λ})$ samples and $O(frac{n}{m varepsilon^4 λ^2})$ queries, and (ii) a closeness tester using $O(m + frac{n^{2/3}}{varepsilon^{4/3} λ} + frac{1}{varepsilon^4 λ^3})$ samples and $O(frac{n^2}{m^2 varepsilon^4 λ^3})$ queries. Moreover, we show that these query complexities are tight for all testers using $m ll n$ samples. Next, we show that for testing closeness using $m = widetilde{O}(frac{n}{varepsilon^2λ})$ samples we can achieve query complexity $widetilde{O}(frac{1}{varepsilon^2λ})$ which is nearly optimal even for the basic task of bias estimation with unbounded samples. Our uniformity testers work in the more challenging setting where the contaminated samples are generated by an adaptive adversary (at the cost of a $log n$ factor). Finally, we show that our lower bounds can be circumvented if the algorithm is provided with the PDF of the mixture.

Problem

Research questions and friction points this paper is trying to address.

Testing distributions with limited access to relevant data samples

Developing verification queries to identify noise-contaminated samples

Establishing trade-offs between sample size and query complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Verification queries identify samples from target distribution

Adaptive sampling strategy balances samples and queries

Handles adversarial noise contamination in distribution testing

🔎 Similar Papers

A New Upper Bound for Distributed Hypothesis Testing Using the Auxiliary Receiver Approach