🤖 AI Summary
This paper investigates the asymptotic consistency of $k$-means clustering under weak moment conditions—specifically, when the underlying distribution has only a finite first moment (potentially infinite variance). Using probabilistic analysis and statistical learning theory, we establish, for the first time, that even when the population $k$-means solution is unique, the empirical cluster centers may fail to converge due to extreme cluster imbalance. We identify cluster size imbalance as the fundamental cause of inconsistency. To address this, we propose imposing a pre-specified cluster balance constraint as a regularization mechanism and derive verifiable sufficient conditions under which empirical centers regain asymptotic consistency under merely finite expectation. Our work not only ensures well-posedness of the population $k$-means problem under weak moments but also provides theoretical foundations and practical corrective strategies for clustering stability with heavy-tailed data.
📝 Abstract
A celebrated result of Pollard proves asymptotic consistency for $k$-means clustering when the population distribution has finite variance. In this work, we point out that the population-level $k$-means clustering problem is, in fact, well-posed under the weaker assumption of a finite expectation, and we investigate whether some form of asymptotic consistency holds in this setting. As we illustrate in a variety of negative results, the complete story is quite subtle; for example, the empirical $k$-means cluster centers may fail to converge even if there exists a unique set of population $k$-means cluster centers. A detailed analysis of our negative results reveals that inconsistency arises because of an extreme form of cluster imbalance, whereby the presence of outlying samples leads to some empirical $k$-means clusters possessing very few points. We then give a collection of positive results which show that some forms of asymptotic consistency, under only the assumption of finite expectation, may be recovered by imposing some a priori degree of balance among the empirical $k$-means clusters.