A Distribution Testing Approach to Clustering Distributions

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

268K/year

🤖 AI Summary

This paper studies the distribution clustering problem: given $k$ unknown distributions partitioned into two equal-sized clusters of size $r$, where intra-cluster distributions are identical and inter-cluster total variation distance equals $varepsilon$, the goal is to exactly recover this bipartition. We provide the first systematic characterization of the sample complexity as a function of domain size $n$, number of distributions $k$, cluster size $r$, and separation $varepsilon$. Under both known- and unknown-distribution settings, we establish nearly tight upper and lower bounds—matching up to a multiplicative $O(log k)$ factor. Our approach integrates techniques from distribution testing and statistical learning theory, augmented by refined sampling complexity analysis. The results yield an optimal, parameter-uniform characterization of cluster recoverability across the entire parameter regime, establishing a foundational sample complexity benchmark for structural learning of high-dimensional distributions.

Technology Category

Application Category

📝 Abstract

We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,varepsilon)$ (up to an $O(log k)$ factor) for all regimes.

Problem

Research questions and friction points this paper is trying to address.

Clustering distributions into two groups based on hidden partitions

Determining sample complexity for known and unknown cluster distributions

Establishing tight bounds on parameters like domain size and distance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution clustering via statistical testing

Sample complexity bounds for known and unknown distributions

Tight characterization across domain size, clusters, distance

🔎 Similar Papers

No similar papers found.