Sublinear Algorithms for Estimating Single-Linkage Clustering Costs

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of efficiently estimating the total cost of single-linkage hierarchical clustering—i.e., approximating the sum of costs $ ext{cost}_k$ over all $k$, denoted $ ext{cost}(G)$, for a weighted graph (or metric space)—in sublinear time. We propose the first sampling-based algorithm under the adjacency-list query model, leveraging structural properties of minimum and maximum spanning trees. Our method achieves a $(1pmvarepsilon)$-approximation to $ ext{cost}(G)$ in $widetilde{O}(dsqrt{W}/varepsilon^3)$ time, with additive error at most $varepsilon cdot ext{cost}(G)$. Moreover, it simultaneously provides $(1+varepsilon)$-approximations to each $ ext{cost}_k$ in an average-case sense. This is the first algorithm to approximate the total single-linkage hierarchical clustering cost, and its time complexity nearly matches the theoretical lower bound—significantly improving upon the naive $O(n^2)$ approach—and is thus well-suited for large-scale distance or similarity graph analysis.

Technology Category

Application Category

📝 Abstract
Single-linkage clustering is a fundamental method for data analysis. Algorithmically, one can compute a single-linkage $k$-clustering (a partition into $k$ clusters) by computing a minimum spanning tree and dropping the $k-1$ most costly edges. This clustering minimizes the sum of spanning tree weights of the clusters. This motivates us to define the cost of a single-linkage $k$-clustering as the weight of the corresponding spanning forest, denoted by $mathrm{cost}_k$. Besides, if we consider single-linkage clustering as computing a hierarchy of clusterings, the total cost of the hierarchy is defined as the sum of the individual clusterings, denoted by $mathrm{cost}(G) = sum_{k=1}^{n} mathrm{cost}_k$. In this paper, we assume that the distances between data points are given as a graph $G$ with average degree $d$ and edge weights from ${1,dots, W}$. Given query access to the adjacency list of $G$, we present a sampling-based algorithm that computes a succinct representation of estimates $widehat{mathrm{cost}}_k$ for all $k$. The running time is $ ilde O(dsqrt{W}/varepsilon^3)$, and the estimates satisfy $sum_{k=1}^{n} |widehat{mathrm{cost}}_k - mathrm{cost}_k| le varepsiloncdot mathrm{cost}(G)$, for any $0<varepsilon <1$. Thus we can approximate the cost of every $k$-clustering upto $(1+varepsilon)$ factor emph{on average}. In particular, our result ensures that we can estimate $cost(G)$ upto a factor of $1pm varepsilon$ in the same running time. We also extend our results to the setting where edges represent similarities. In this case, the clusterings are defined by a maximum spanning tree, and our algorithms run in $ ilde{O}(dW/varepsilon^3)$ time. We futher prove nearly matching lower bounds for estimating the total clustering cost and we extend our algorithms to metric space settings.
Problem

Research questions and friction points this paper is trying to address.

Estimating single-linkage clustering costs efficiently
Developing sublinear algorithms for clustering cost approximation
Providing fast estimates for hierarchical clustering total costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampling-based algorithm for sublinear clustering cost estimation
Computes succinct estimates via adjacency list query access
Achieves near-linear runtime with average degree dependency
🔎 Similar Papers
No similar papers found.
P
Pan Peng
School of Computer Science and Technology, University of Science and Technology of China
Christian Sohler
Christian Sohler
Professor for Algorithmic Data Analysis, University of Cologne
Theoretical Computer ScienceAlgorithms
Y
Yi Xu
School of Computer Science and Technology, University of Science and Technology of China