🤖 AI Summary
This work addresses the global convergence of gradient EM for overparameterized Gaussian mixture models (GMMs) with $n > 2$ components fitted to data generated from a single Gaussian—where existing analyses suffer from sublinear convergence rates, non-monotonic iterates, and lack of global guarantees.
Method: We develop the first nonconvex convergence analysis framework grounded in the likelihood function, explicitly accounting for the geometry of the overparameterized regime.
Contribution/Results: We establish the first rigorous global convergence guarantee for gradient EM in *any* $n > 2$ overparameterized GMM setting, with a convergence rate of $O(1/sqrt{t})$. Furthermore, we characterize a class of “bad local regions” that induce exponential slowdown, revealing their geometric origin in parameter space. This work fills a long-standing gap by providing the first global convergence theory—including explicit rate and identification of fundamental optimization barriers—for overparameterized GMMs beyond the $n=2$ case.
📝 Abstract
We study the gradient Expectation-Maximization (EM) algorithm for Gaussian Mixture Models (GMM) in the over-parameterized setting, where a general GMM with $n>1$ components learns from data that are generated by a single ground truth Gaussian distribution. While results for the special case of 2-Gaussian mixtures are well-known, a general global convergence analysis for arbitrary $n$ remains unresolved and faces several new technical barriers since the convergence becomes sub-linear and non-monotonic. To address these challenges, we construct a novel likelihood-based convergence analysis framework and rigorously prove that gradient EM converges globally with a sublinear rate $O(1/sqrt{t})$. This is the first global convergence result for Gaussian mixtures with more than $2$ components. The sublinear convergence rate is due to the algorithmic nature of learning over-parameterized GMM with gradient EM. We also identify a new emerging technical challenge for learning general over-parameterized GMM: the existence of bad local regions that can trap gradient EM for an exponential number of steps.