🤖 AI Summary
This paper studies the approximation hardness and efficient algorithm design for $ell_2^2$ min-sum $k$-clustering—minimizing the sum of squared Euclidean distances within each cluster. First, it establishes the first NP-hardness-of-approximation lower bounds: $1.056$ for general instances and $1.327$ for high-dimensional sparse instances. Second, it proposes the first near-linear-time parameterized PTAS, running in $O(n^{1+o(1)} d cdot exp((k/varepsilon)^{O(1)}))$. Third, it introduces a novel learning-augmented framework that achieves a $frac{1+gammaalpha}{(1-alpha)^2}$-approximation guarantee under prediction error $alpha$, thereby surpassing classical approximation limits. The analysis integrates techniques from Johnson coverage-based hardness reductions, combinatorial optimization, and computational geometry to rigorously characterize the intrinsic complexity of the problem and establish a new paradigm for prior-informed clustering.
📝 Abstract
The $ell_2^2$ min-sum $k$-clustering problem is to partition an input set into clusters $C_1,ldots,C_k$ to minimize $sum_{i=1}^ksum_{p,qin C_i}|p-q|_2^2$. Although $ell_2^2$ min-sum $k$-clustering is NP-hard, it is not known whether it is NP-hard to approximate $ell_2^2$ min-sum $k$-clustering beyond a certain factor. In this paper, we give the first hardness-of-approximation result for the $ell_2^2$ min-sum $k$-clustering problem. We show that it is NP-hard to approximate the objective to a factor better than $1.056$ and moreover, assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard to approximate the objective to a factor better than 1.327. We then complement our hardness result by giving a nearly linear time parameterized PTAS for $ell_2^2$ min-sum $k$-clustering running in time $Oleft(n^{1+o(1)}dcdot exp((kcdotvarepsilon^{-1})^{O(1)})
ight)$, where $d$ is the underlying dimension of the input dataset. Finally, we consider a learning-augmented setting, where the algorithm has access to an oracle that outputs a label $iin[k]$ for input point, thereby implicitly partitioning the input dataset into $k$ clusters that induce an approximately optimal solution, up to some amount of adversarial error $alphainleft[0,frac{1}{2}
ight)$. We give a polynomial-time algorithm that outputs a $frac{1+gammaalpha}{(1-alpha)^2}$-approximation to $ell_2^2$ min-sum $k$-clustering, for a fixed constant $gamma>0$.