🤖 AI Summary
Python lacks Gaussian Mixture Model (GMM) libraries supporting fully automated hyperparameter selection—specifically, the number of components and covariance structure—while existing tools (e.g., R’s mclust) rely solely on the Bayesian Information Criterion (BIC). Model selection for GMM is NP-hard and remains challenging in practice.
Method: We propose the first fully automated, end-to-end GMM modeling framework for Python. It addresses the NP-hard model selection problem via a stabilized initialization strategy, hierarchical component growth, integrated BIC/AIC scoring, covariance constraint optimization, and hierarchical clustering–guided structure search.
Contribution/Results: Our method automatically determines optimal component count and covariance type across multiple benchmark datasets, matching mclust’s clustering accuracy and outperforming it significantly on several tasks. It naturally extends to Hierarchical GMM (HGMM), supporting clustering, discriminant analysis, and density estimation. The implementation is scikit-learn–compatible and open-source.
📝 Abstract
Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these problems. However, Python has lacked such a package. We therefore introduce AutoGMM, a Python algorithm for automatic Gaussian mixture modeling, and its hierarchical version, HGMM. AutoGMM builds upon scikit-learn's AgglomerativeClustering and GaussianMixture classes, with certain modifications to make the results more stable. Empirically, on several different applications, AutoGMM performs approximately as well as mclust, and sometimes better. Conclusions: AutoMM, a freely available Python package, enables efficient Gaussian mixture modeling by automatically selecting the initialization, number of clusters and covariance constraints.