🤖 AI Summary
This paper addresses robust clustering under sub-Gaussian mixture models in the presence of adversarial outliers, aiming for statistically optimal misclustering rate. We propose the first lightweight iterative algorithm based on coordinate-wise median estimation: it achieves convergence within a constant number of iterations via robust initialization and coordinate-wise median updates, while preserving the optimal statistical rate even under adversarial contamination. Theoretically, its misclustering rate attains the information-theoretic lower bound. Empirically, the method significantly outperforms existing robust clustering algorithms on both synthetic and real-world datasets, and matches the performance of Lloyd’s algorithm in outlier-free settings. Our key innovation lies in introducing coordinate-wise median estimation into mixture model clustering—achieving, for the first time, a unified guarantee of both robustness against adversarial outliers and statistical optimality.
📝 Abstract
We consider the problem of clustering data points coming from sub-Gaussian mixtures. Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers. In contrast, clustering methods seemingly robust to adversarial perturbations are not known to satisfy the optimal statistical guarantees. We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present. Our algorithm achieves the optimal error rate in constant iterations when a weak initialization condition is satisfied. In the absence of outliers, in fixed dimensions, our theoretical guarantees are similar to that of the Lloyd algorithm. Extensive experiments on various simulated and public datasets are conducted to support the theoretical guarantees of our method.