Spectral Clustering in Birthday Paradox Time

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the problem of efficiently determining the cluster membership of vertices in $(k, \phi, \varepsilon)$-clusterable graphs. The authors propose a hybrid representation based on logarithmic-length random walks, combined with nearest neighbor search to enable fast cluster membership queries. By constructing a representation that achieves the optimal sampling complexity dictated by the birthday paradox lower bound, their method is the first to theoretically guarantee that query complexity decreases as the number of clusters $k$ increases, thereby resolving a longstanding inconsistency between conceptual clarity and computational efficiency in prior approaches. The resulting clustering oracle achieves query time $\approx (n/k)^{1/2 + O(\varepsilon/\phi^2)}$ and space complexity $k \cdot (n/k)^{1/2 + O(\varepsilon/\phi^2)}$, matching the known theoretical lower bounds.

Technology Category

Application Category

📝 Abstract

Given a vertex in a $(k, \varphi, \epsilon)$-clusterable graph, i.e. a graph whose vertex set can be partitioned into a disjoint union of $\varphi$-expanders of size $\approx n/k$ with outer conductance bounded by $\epsilon$, can one quickly tell which cluster it belongs to? This question goes back to the expansion testing problem of Goldreich and Ron'11. For $k=2$ a sample of $\approx n^{1/2+O(\epsilon/\varphi^2)}$ logarithmic length walks from a given vertex approximately determines its cluster membership by the birthday paradox: two vertices whose random walk samples are `close'are likely in the same cluster. The study of the general case $k>2$ was initiated by Czumaj, Peng and Sohler [STOC'15], and the works of Chiplunkar et al. [FOCS'18], Gluch et al. [SODA'21] showed that $\approx \text{poly}(k)\cdot n^{1/2+O(\epsilon/\varphi^2)}$ random walk samples suffice for general $k$. This matches the $k=2$ result up to polynomial factors in $k$, but creates a conceptual inconsistency: if the birthday paradox is the guiding phenomenon, then the query complexity should decrease with the number of clusters $k$! Since clusters have size $\approx n/k$, we expect to need $\approx (n/k)^{1/2+O(\epsilon/\varphi^2)}$ random walk samples, which decreases with $k$. We design a novel representation of vertices in a $(k, \varphi, \epsilon)$-clusterable graph by a mixture of logarithmic length walks. This representation uses the optimal $\approx (n/k)^{1/2+O(\epsilon/\varphi^2)}$ walks per vertex, and allows for a fast nearest neighbor search: given $k$ vertices representing the clusters, we can find the cluster of a given query vertex $x$ using nearly linear time in the representation size of $x$. This gives a clustering oracle with query time $\approx (n/k)^{1/2+O(\epsilon/\varphi^2)}$ and space complexity $k\cdot (n/k)^{1/2+O(\epsilon/\varphi^2)}$, matching the birthday paradox bound.

Problem

Research questions and friction points this paper is trying to address.

Spectral Clustering

Birthday Paradox

Clusterable Graphs

Random Walks

Query Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

spectral clustering

birthday paradox

random walks