AutoGMM: Automatic and Hierarchical Gaussian Mixture Modeling in Python

📅 2019-09-06

📈 Citations: 9

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Python lacks Gaussian Mixture Model (GMM) libraries supporting fully automated hyperparameter selection—specifically, the number of components and covariance structure—while existing tools (e.g., R’s mclust) rely solely on the Bayesian Information Criterion (BIC). Model selection for GMM is NP-hard and remains challenging in practice. Method: We propose the first fully automated, end-to-end GMM modeling framework for Python. It addresses the NP-hard model selection problem via a stabilized initialization strategy, hierarchical component growth, integrated BIC/AIC scoring, covariance constraint optimization, and hierarchical clustering–guided structure search. Contribution/Results: Our method automatically determines optimal component count and covariance type across multiple benchmark datasets, matching mclust’s clustering accuracy and outperforming it significantly on several tasks. It naturally extends to Hierarchical GMM (HGMM), supporting clustering, discriminant analysis, and density estimation. The implementation is scikit-learn–compatible and open-source.

📝 Abstract

Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these problems. However, Python has lacked such a package. We therefore introduce AutoGMM, a Python algorithm for automatic Gaussian mixture modeling, and its hierarchical version, HGMM. AutoGMM builds upon scikit-learn's AgglomerativeClustering and GaussianMixture classes, with certain modifications to make the results more stable. Empirically, on several different applications, AutoGMM performs approximately as well as mclust, and sometimes better. Conclusions: AutoMM, a freely available Python package, enables efficient Gaussian mixture modeling by automatically selecting the initialization, number of clusters and covariance constraints.

Problem

Research questions and friction points this paper is trying to address.

Automating Gaussian mixture model hyperparameter selection in Python

Providing uncertainty-aware clustering without requiring expert knowledge

Addressing lack of comparable automated GMM tools in Python ecosystem

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates GMM hyperparameter selection via agglomerative Mahalanobis initialization

Uses parallelized model selection with information criteria

Provides drop-in Python tool with strong out-of-the-box performance

🔎 Similar Papers

No similar papers found.