On the Sparsifiability of Correlation Clustering: Approximation Guarantees under Edge Sampling

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the scalability limitations of strong LP-based approximation algorithms for large-scale correlation clustering, which typically require Θ(n³) triangle inequality constraints. The study investigates the trade-off between sparsification via edge sampling and approximation quality, revealing the pivotal role of pseudometric structure in ensuring both algorithmic scalability and robustness. By leveraging VC-dimension analysis—showing that the clustering disagreement class has VC dimension n−1—along with coreset construction and a cutting-plane method, the authors build an additive ε-coreset of size Õ(n/ε²). They further propose the first sparse variant of LP-PIVOT that achieves a 10/3 approximation ratio (with controllable additive error) while observing only Õ(n^{3/2}) edges, and prove this edge-sample bound is tight.

Technology Category

Application Category

📝 Abstract

Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require $\Theta(n^3)$ triangle inequality constraints and are prohibitive at scale. We initiate the study of \emph{sparsification--approximation trade-offs} for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly $n{-}1$, yielding additive $\varepsilon$-coresets of optimal size $\tilde{O}(n/\varepsilon^2)$; that at most $\binom{n}{2}$ triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust $\frac{10}{3}$-approximation (up to an additive term controlled by an empirically computable imputation-quality statistic $\overline{\Gamma}_w$) once $\tilde{\Theta}(n^{3/2})$ edges are observed, a threshold we prove is sharp. On the negative side, we show via Yao's minimax principle that without pseudometric structure, any algorithm observing $o(n)$ uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.

Problem

Research questions and friction points this paper is trying to address.

Correlation Clustering

Sparsification

Approximation Guarantees

Edge Sampling

Linear Programming

Innovation

Methods, ideas, or system contributions that make the work stand out.

Correlation Clustering

Sparsification

Pseudometric