Approximating Dasgupta Cost in Sublinear Time from a Few Random Seeds

📅 2022-07-06

📈 Citations: 4

✨ Influential: 1

🤖 AI Summary

This work addresses sublinear-time testing of hierarchical clustering structure and approximation of the Dasgupta cost for $(k,varepsilon)$-clusterable graphs. For cluster-separable graphs, we propose the first sublinear-time algorithm requiring only $O(n^{1/3})$ randomly sampled labeled seed vertices to test hierarchical clustering structure and achieve an $O(sqrt{log k})$-approximation to the Dasgupta cost. Our method integrates local graph exploration, random sampling, and cluster label propagation, while lightweightly simulating the Charikar–Chatziafratis algorithm to attain computation in $ ilde{O}(n^{1/2+O(varepsilon)})$ time—significantly improving upon the $O(n^2)$ complexity of brute-force approaches. Experiments confirm the algorithm’s efficiency and robustness on cluster-separable graphs, overcoming a longstanding bottleneck in sublinear characterization of inter-cluster connectivity structure.

📝 Abstract

Testing graph cluster structure has been a central object of study in property testing since the foundational work of Goldreich and Ron [STOC'96] on expansion testing, i.e. the problem of distinguishing between a single cluster (an expander) and a graph that is far from a single cluster. More generally, a $(k, epsilon)$-clusterable graph $G$ is a graph whose vertex set admits a partition into $k$ induced expanders, each with outer conductance bounded by $epsilon$. A recent line of work initiated by Czumaj, Peng and Sohler [STOC'15] has shown how to test whether a graph is close to $(k, epsilon)$-clusterable, and to locally determine which cluster a given vertex belongs to with misclassification rate $approx epsilon$, but no sublinear time algorithms for learning the structure of inter-cluster connections are known. As a simple example, can one locally distinguish between the `cluster graph' forming a line and a clique? In this paper, we consider the problem of testing the hierarchical cluster structure of $(k, epsilon)$-clusterable graphs in sublinear time. Our measure of hierarchical clusterability is the well-established Dasgupta cost, and our main result is an algorithm that approximates Dasgupta cost of a $(k, epsilon)$-clusterable graph in sublinear time, using a small number of randomly chosen seed vertices for which cluster labels are known. Our main result is an $O(sqrt{log k})$ approximation to Dasgupta cost of $G$ in $approx n^{1/2+O(epsilon)}$ time using $approx n^{1/3}$ seeds, effectively giving a sublinear time simulation of the algorithm of Charikar and Chatziafratis [SODA'17] on clusterable graphs. To the best of our knowledge, ours is the first result on approximating the hierarchical clustering properties of such graphs in sublinear time.

Problem

Research questions and friction points this paper is trying to address.

Approximating Dasgupta cost in sublinear time

Testing hierarchical cluster structure of graphs

Using few random seeds for cluster labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sublinear time algorithm

Approximates Dasgupta cost

Uses random seed vertices

🔎 Similar Papers

No similar papers found.

Authors to Follow