๐ค AI Summary
This paper studies the distribution clustering problem: given $k$ unknown distributions partitioned into two equal-sized clusters of size $r$, where intra-cluster distributions are identical and inter-cluster total variation distance equals $varepsilon$, the goal is to exactly recover this bipartition. We provide the first systematic characterization of the sample complexity as a function of domain size $n$, number of distributions $k$, cluster size $r$, and separation $varepsilon$. Under both known- and unknown-distribution settings, we establish nearly tight upper and lower boundsโmatching up to a multiplicative $O(log k)$ factor. Our approach integrates techniques from distribution testing and statistical learning theory, augmented by refined sampling complexity analysis. The results yield an optimal, parameter-uniform characterization of cluster recoverability across the entire parameter regime, establishing a foundational sample complexity benchmark for structural learning of high-dimensional distributions.
๐ Abstract
We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,varepsilon)$ (up to an $O(log k)$ factor) for all regimes.