A Distribution Testing Approach to Clustering Distributions

๐Ÿ“… 2025-12-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper studies the distribution clustering problem: given $k$ unknown distributions partitioned into two equal-sized clusters of size $r$, where intra-cluster distributions are identical and inter-cluster total variation distance equals $varepsilon$, the goal is to exactly recover this bipartition. We provide the first systematic characterization of the sample complexity as a function of domain size $n$, number of distributions $k$, cluster size $r$, and separation $varepsilon$. Under both known- and unknown-distribution settings, we establish nearly tight upper and lower boundsโ€”matching up to a multiplicative $O(log k)$ factor. Our approach integrates techniques from distribution testing and statistical learning theory, augmented by refined sampling complexity analysis. The results yield an optimal, parameter-uniform characterization of cluster recoverability across the entire parameter regime, establishing a foundational sample complexity benchmark for structural learning of high-dimensional distributions.

Technology Category

Application Category

๐Ÿ“ Abstract
We study the following distribution clustering problem: Given a hidden partition of $k$ distributions into two groups, such that the distributions within each group are the same, and the two distributions associated with the two clusters are $varepsilon$-far in total variation, the goal is to recover the partition. We establish upper and lower bounds on the sample complexity for two fundamental cases: (1) when one of the cluster's distributions is known, and (2) when both are unknown. Our upper and lower bounds characterize the sample complexity's dependence on the domain size $n$, number of distributions $k$, size $r$ of one of the clusters, and distance $varepsilon$. In particular, we achieve tightness with respect to $(n,k,r,varepsilon)$ (up to an $O(log k)$ factor) for all regimes.
Problem

Research questions and friction points this paper is trying to address.

Clustering distributions into two groups based on hidden partitions
Determining sample complexity for known and unknown cluster distributions
Establishing tight bounds on parameters like domain size and distance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution clustering via statistical testing
Sample complexity bounds for known and unknown distributions
Tight characterization across domain size, clusters, distance
๐Ÿ”Ž Similar Papers
No similar papers found.