🤖 AI Summary
This work addresses the problem of k-means clustering in the presence of outliers by proposing a robust method based on k-nearest neighbor (KNN) distances. Given a budget to remove z outliers, the approach simply discards the z points with the largest KNN distances and then applies standard k-means clustering to the remaining data. Under a mild assumption on the minimum optimal cluster size, the method is the first to theoretically guarantee a constant-factor approximation—comparable to existing algorithms—without requiring additional cluster centers or excessive point removal, thereby establishing a formal connection between outlier detection and robust clustering. Empirical evaluations on multiple real-world datasets demonstrate that the proposed method is both efficient and practical, achieving clustering costs and running times that match or outperform several more complex baseline algorithms.
📝 Abstract
Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.