🤖 AI Summary
This work identifies the fundamental failure mechanism of Lloyd’s k-means algorithm under high-dimensional, high-noise, and small-sample regimes. Theoretically, we show that as dimensionality increases, signal-to-noise ratio (SNR) decreases, or sample size shrinks, the k-means objective develops an exponential number of spurious fixed points—causing nearly all initializations to converge to meaningless clusterings. This explains pervasive clustering failures in practical applications such as cryo-electron microscopy (cryo-EM). Building on a Gaussian mixture model, we provide the first rigorous characterization of fixed-point proliferation in high-dimensional sparse settings. Using probabilistic concentration arguments and structural analysis of fixed points, we derive precise critical phase-transition conditions governing the interplay among dimensionality, SNR, and sample size that trigger algorithmic breakdown. Our analysis establishes a novel theoretical framework for understanding the degradation of expectation-maximization–type algorithms in high-dimensional, low-SNR regimes.
📝 Abstract
Clustering and estimating cluster means are core problems in statistics and machine learning, with k-means and Expectation Maximization (EM) being two widely used algorithms. In this work, we provide a theoretical explanation for the failure of k-means in high-dimensional settings with high noise and limited sample sizes, using a simple Gaussian Mixture Model (GMM). We identify regimes where, with high probability, almost every partition of the data becomes a fixed point of the k-means algorithm. This study is motivated by challenges in the analysis of more complex cases, such as masked GMMs, and those arising from applications in Cryo-Electron Microscopy.