🤖 AI Summary
Average-linkage hierarchical agglomerative clustering (HAC) suffers from fundamental scalability bottlenecks on large-scale data, lacking both near-linear sequential time complexity and efficient parallelization.
Method: We conduct a systematic fine-grained complexity analysis grounded in computational hardness assumptions—specifically the Orthogonal Vectors (OV) hypothesis—and structural graph theory.
Contribution/Results: We establish the first conditional lower bound of Ω(n^{3/2−ε}) for sequential average-linkage HAC under OV, proving it is CC-hard on tree-structured inputs. Moreover, we precisely characterize its parallelizability: average-linkage HAC admits NC algorithms on paths and low-height hierarchies, but is inherently not in NC for general graphs—i.e., it cannot be efficiently parallelized in polylogarithmic time with polynomially many processors. These results provide the first rigorous complexity-theoretic explanation for the long-standing scalability limitations of HAC and yield the first formal theoretical criterion for clustering algorithm selection.
📝 Abstract
Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of $n^{3/2-epsilon}$ on $n$ node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter $4$. On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small.