Spectral Clustering with Side Information

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies graph clustering with noisy vertex labels: given an $n$-vertex graph containing $k$ planted clusters that satisfy $Omega(1)$-expansion and $varepsilon$-sparse cuts, where each vertex label is flipped independently with probability $delta$, classical approaches—relying solely on either graph structure or labels—achieve only $min{varepsilon,delta}$ misclassification error. We propose the first sublinear-time spectral clustering algorithm that jointly leverages graph topology and label information via spectral fuzzy vertex identification and edge reweighting. Our method reduces the misclassification rate to $approx widetilde{O}(varepsilondelta)$, nearly matching the information-theoretic lower bound. It preserves cluster expansion properties and enables polynomial-time reconstruction of the graph into a $(k, widetilde{O}(varepsilondelta), Omega(1))$-clusterable instance. To our knowledge, this is the first spectral method achieving near-optimal error rate with sublinear time complexity under this noisy-label setting.

Technology Category

Application Category

📝 Abstract
In the graph clustering problem with a planted solution, the input is a graph on $n$ vertices partitioned into $k$ clusters, and the task is to infer the clusters from graph structure. A standard assumption is that clusters induce well-connected subgraphs (i.e. $Ω(1)$-expanders), and form $ε$-sparse cuts. Such a graph defines the clustering uniquely up to $approx ε$ misclassification rate, and efficient algorithms for achieving this rate are known. While this vanilla version of graph clustering is well studied, in practice, vertices of the graph are typically equipped with labels that provide additional information on cluster ids of the vertices. For example, each vertex could have a cluster label that is corrupted independently with probability $δ$. Using only one of the two sources of information leads to misclassification rate $min{ε, δ}$, but can they be combined to achieve a rate of $approx εδ$? In this paper, we give an affirmative answer to this question and present a sublinear-time algorithm in the number of vertices $n$. Our key algorithmic insight is a new observation on ``spectrally ambiguous'' vertices in a well-clusterable graph. While our sublinear-time classifier achieves the nearly optimal $approx widetilde O(εδ)$ misclassification rate, the approximate clusters that it outputs do not necessarily induce expanders in the graph $G$. In our second result, we give a polynomial-time algorithm that reweights edges of the original $(k, ε, Ω(1))$-clusterable graph to transform it into a $(k, widetilde O(εδ), Ω(1))$-clusterable one (for constant $k$), improving sparsity of cuts nearly optimally and preserving expansion properties of the communities - an algorithm for refining community structure of the input graph.
Problem

Research questions and friction points this paper is trying to address.

Combining graph structure and noisy labels to improve clustering accuracy
Developing sublinear-time algorithm for spectral clustering with side information
Reweighting edges to preserve expansion while improving sparsity cuts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral clustering with corrupted label side information
Sublinear-time algorithm combining graph and label data
Reweighting edges to refine community structure expansion
🔎 Similar Papers
No similar papers found.