🤖 AI Summary
High-dimensional skewed data clustering faces challenges including parameter redundancy, lack of structural interpretability, and inability of conventional mixture models to capture hierarchical variable dependencies.
Method: This paper proposes the ultrametric Manly mixture model family, which is the first to embed an ultrametric covariance structure into the Manly distribution framework. It employs the Manly transformation to model asymmetry and leverages ultrametric decomposition for parameter compression—reducing parameters by up to 60% relative to standard Manly mixtures—while explicitly encoding latent hierarchical relationships among variables. A two-step model selection strategy, driven by information criteria (BIC/ICL), is further introduced to address mixture model selection ambiguity.
Results: Experiments on synthetic and real-world datasets demonstrate substantial improvements in clustering accuracy and model selection consistency. The method also enables interpretable visualization of within-cluster hierarchical structures.
📝 Abstract
A family of parsimonious ultrametric mixture models with the Manly transformation is developed for clustering high-dimensional and asymmetric data. Advances in Gaussian mixture modeling sufficiently handle high-dimensional data but struggle with the common presence of skewness. While these advances reduce the number of free parameters, they often provide limited insight into the structure and interpretation of the clusters. To address this shortcoming, this research implements the extended ultrametric covariance structure and the Manly transformation resulting in the parsimonious ultrametric Manly mixture model family. The ultrametric covariance structure reduces the number of free parameters while identifying latent hierarchical relationships between and within groups of variables. This phenomenon allows the visualization of hierarchical relationships within individual clusters, improving cluster interpretability. Additionally, as with many classes of mixture models, model selection remains a fundamental challenge; a two-step model selection procedure is proposed herein. With simulation studies and real data analyses, we demonstrate improved model selection via the proposed two-step method, and the effective clustering performance for the proposed family.