🤖 AI Summary
This work addresses the scalability limitations of strong LP-based approximation algorithms for large-scale correlation clustering, which typically require Θ(n³) triangle inequality constraints. The study investigates the trade-off between sparsification via edge sampling and approximation quality, revealing the pivotal role of pseudometric structure in ensuring both algorithmic scalability and robustness. By leveraging VC-dimension analysis—showing that the clustering disagreement class has VC dimension n−1—along with coreset construction and a cutting-plane method, the authors build an additive ε-coreset of size Õ(n/ε²). They further propose the first sparse variant of LP-PIVOT that achieves a 10/3 approximation ratio (with controllable additive error) while observing only Õ(n^{3/2}) edges, and prove this edge-sample bound is tight.
📝 Abstract
Correlation Clustering (CC) is a fundamental unsupervised learning primitive whose strongest LP-based approximation guarantees require $\Theta(n^3)$ triangle inequality constraints and are prohibitive at scale. We initiate the study of \emph{sparsification--approximation trade-offs} for CC, asking how much edge information is needed to retain LP-based guarantees. We establish a structural dichotomy between pseudometric and general weighted instances. On the positive side, we prove that the VC dimension of the clustering disagreement class is exactly $n{-}1$, yielding additive $\varepsilon$-coresets of optimal size $\tilde{O}(n/\varepsilon^2)$; that at most $\binom{n}{2}$ triangle inequalities are active at any LP vertex, enabling an exact cutting-plane solver; and that a sparsified variant of LP-PIVOT, which imputes missing LP marginals via triangle inequalities, achieves a robust $\frac{10}{3}$-approximation (up to an additive term controlled by an empirically computable imputation-quality statistic $\overline{\Gamma}_w$) once $\tilde{\Theta}(n^{3/2})$ edges are observed, a threshold we prove is sharp. On the negative side, we show via Yao's minimax principle that without pseudometric structure, any algorithm observing $o(n)$ uniformly random edges incurs an unbounded approximation ratio, demonstrating that the pseudometric condition governs not only tractability but also the robustness of CC to incomplete information.