🤖 AI Summary
This work addresses sublinear-time testing of hierarchical clustering structure and approximation of the Dasgupta cost for $(k,varepsilon)$-clusterable graphs. For cluster-separable graphs, we propose the first sublinear-time algorithm requiring only $O(n^{1/3})$ randomly sampled labeled seed vertices to test hierarchical clustering structure and achieve an $O(sqrt{log k})$-approximation to the Dasgupta cost. Our method integrates local graph exploration, random sampling, and cluster label propagation, while lightweightly simulating the Charikar–Chatziafratis algorithm to attain computation in $ ilde{O}(n^{1/2+O(varepsilon)})$ time—significantly improving upon the $O(n^2)$ complexity of brute-force approaches. Experiments confirm the algorithm’s efficiency and robustness on cluster-separable graphs, overcoming a longstanding bottleneck in sublinear characterization of inter-cluster connectivity structure.
📝 Abstract
Testing graph cluster structure has been a central object of study in property testing since the foundational work of Goldreich and Ron [STOC'96] on expansion testing, i.e. the problem of distinguishing between a single cluster (an expander) and a graph that is far from a single cluster. More generally, a $(k, epsilon)$-clusterable graph $G$ is a graph whose vertex set admits a partition into $k$ induced expanders, each with outer conductance bounded by $epsilon$. A recent line of work initiated by Czumaj, Peng and Sohler [STOC'15] has shown how to test whether a graph is close to $(k, epsilon)$-clusterable, and to locally determine which cluster a given vertex belongs to with misclassification rate $approx epsilon$, but no sublinear time algorithms for learning the structure of inter-cluster connections are known. As a simple example, can one locally distinguish between the `cluster graph' forming a line and a clique? In this paper, we consider the problem of testing the hierarchical cluster structure of $(k, epsilon)$-clusterable graphs in sublinear time. Our measure of hierarchical clusterability is the well-established Dasgupta cost, and our main result is an algorithm that approximates Dasgupta cost of a $(k, epsilon)$-clusterable graph in sublinear time, using a small number of randomly chosen seed vertices for which cluster labels are known. Our main result is an $O(sqrt{log k})$ approximation to Dasgupta cost of $G$ in $approx n^{1/2+O(epsilon)}$ time using $approx n^{1/3}$ seeds, effectively giving a sublinear time simulation of the algorithm of Charikar and Chatziafratis [SODA'17] on clusterable graphs. To the best of our knowledge, ours is the first result on approximating the hierarchical clustering properties of such graphs in sublinear time.