🤖 AI Summary
This paper addresses the problem of efficiently estimating the total cost of single-linkage hierarchical clustering—i.e., approximating the sum of costs $ ext{cost}_k$ over all $k$, denoted $ ext{cost}(G)$, for a weighted graph (or metric space)—in sublinear time. We propose the first sampling-based algorithm under the adjacency-list query model, leveraging structural properties of minimum and maximum spanning trees. Our method achieves a $(1pmvarepsilon)$-approximation to $ ext{cost}(G)$ in $widetilde{O}(dsqrt{W}/varepsilon^3)$ time, with additive error at most $varepsilon cdot ext{cost}(G)$. Moreover, it simultaneously provides $(1+varepsilon)$-approximations to each $ ext{cost}_k$ in an average-case sense. This is the first algorithm to approximate the total single-linkage hierarchical clustering cost, and its time complexity nearly matches the theoretical lower bound—significantly improving upon the naive $O(n^2)$ approach—and is thus well-suited for large-scale distance or similarity graph analysis.
📝 Abstract
Single-linkage clustering is a fundamental method for data analysis. Algorithmically, one can compute a single-linkage $k$-clustering (a partition into $k$ clusters) by computing a minimum spanning tree and dropping the $k-1$ most costly edges. This clustering minimizes the sum of spanning tree weights of the clusters. This motivates us to define the cost of a single-linkage $k$-clustering as the weight of the corresponding spanning forest, denoted by $mathrm{cost}_k$. Besides, if we consider single-linkage clustering as computing a hierarchy of clusterings, the total cost of the hierarchy is defined as the sum of the individual clusterings, denoted by $mathrm{cost}(G) = sum_{k=1}^{n} mathrm{cost}_k$.
In this paper, we assume that the distances between data points are given as a graph $G$ with average degree $d$ and edge weights from ${1,dots, W}$. Given query access to the adjacency list of $G$, we present a sampling-based algorithm that computes a succinct representation of estimates $widehat{mathrm{cost}}_k$ for all $k$. The running time is $ ilde O(dsqrt{W}/varepsilon^3)$, and the estimates satisfy $sum_{k=1}^{n} |widehat{mathrm{cost}}_k - mathrm{cost}_k| le varepsiloncdot mathrm{cost}(G)$, for any $0<varepsilon <1$. Thus we can approximate the cost of every $k$-clustering upto $(1+varepsilon)$ factor emph{on average}. In particular, our result ensures that we can estimate $cost(G)$ upto a factor of $1pm varepsilon$ in the same running time.
We also extend our results to the setting where edges represent similarities. In this case, the clusterings are defined by a maximum spanning tree, and our algorithms run in $ ilde{O}(dW/varepsilon^3)$ time. We futher prove nearly matching lower bounds for estimating the total clustering cost and we extend our algorithms to metric space settings.