Simple KNN-Based Outlier Detection Achieves Robust Clustering

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work addresses the problem of k-means clustering in the presence of outliers by proposing a robust method based on k-nearest neighbor (KNN) distances. Given a budget to remove z outliers, the approach simply discards the z points with the largest KNN distances and then applies standard k-means clustering to the remaining data. Under a mild assumption on the minimum optimal cluster size, the method is the first to theoretically guarantee a constant-factor approximation—comparable to existing algorithms—without requiring additional cluster centers or excessive point removal, thereby establishing a formal connection between outlier detection and robust clustering. Empirical evaluations on multiple real-world datasets demonstrate that the proposed method is both efficient and practical, achieving clustering costs and running times that match or outperform several more complex baseline algorithms.

📝 Abstract

Being robust to the presence of outliers is crucial for applying clustering algorithms in practice. In the $\textit{robust $k$-Means}$ problem (i.e., $k$-Means with outliers), the goal is to remove $z$ outliers and minimize the $k$-Means cost on the remaining points. Despite the close connection between robust $k$-Means and outlier detection, both theoretical and empirical understanding of the effectiveness of $\textit{classic outlier detection heuristics}$ for robust $k$-Means remains limited. In this paper, we prove that under a practical assumption on the optimal cluster sizes, simply removing points with large $K$-Nearest-Neighbor distances achieves performance comparable to prior work in terms of approximation guarantees: it yields a constant-factor reduction from robust $k$-Means to standard $k$-Means, without introducing additional centers or discarding extra outliers, as is commonly required by existing approaches. Empirically, experiments on real-world datasets show that our method outperforms or matches several more sophisticated algorithms in terms of clustering cost and runtime. These results demonstrate that simple KNN-based heuristics can be surprisingly effective for robust clustering, highlighting new opportunities to bridge techniques from outlier detection and clustering.

Problem

Research questions and friction points this paper is trying to address.

robust k-Means

outlier detection

clustering

KNN-based heuristic

Innovation

Methods, ideas, or system contributions that make the work stand out.

robust k-Means

KNN-based outlier detection

clustering with outliers

approximation guarantee