🤖 AI Summary
Traditional model-based clustering suffers from limited expressiveness due to its rigid assumption of homogeneous component distributions. To address this, we propose the Copula-Based Mixture Model (CBMM), which accommodates heterogeneous marginal distributions across components and flexible copula-based dependence structures. We innovatively extend the Generalized Iterative Conditional Estimation (GICE) algorithm—previously unexplored in unsupervised settings—to jointly estimate marginal distribution types, copula families, and their parameters in CBMM. Evaluated on MNIST (70,000 samples) and real cardiac MRI data (276 cases), CBMM achieves significantly improved clustering accuracy and enhanced medical interpretability of subgroups. Simulation studies with 2,000 samples confirm its convergence and robustness. This work breaks the homogeneity paradigm of conventional mixture models and establishes a novel framework for unsupervised subgroup discovery in heterogeneously distributed data.
📝 Abstract
Model-based clustering techniques have been widely applied to various application areas, while most studies focus on canonical mixtures with unique component distribution form. However, this strict assumption is often hard to satisfy. In this paper, we consider the more flexible Copula-Based Mixture Models (CBMMs) for clustering, which allow heterogeneous component distributions composed by flexible choices of marginal and copula forms. More specifically, we propose an adaptation of the Generalized Iterative Conditional Estimation (GICE) algorithm to identify the CBMMs in an unsupervised manner, where the marginal and copula forms and their parameters are estimated iteratively. GICE is adapted from its original version developed for switching Markov model identification with the choice of realization time. Our CBMM-GICE clustering method is then tested on synthetic two-cluster data (N=2000 samples) with discussion of the factors impacting its convergence. Finally, it is compared to the Expectation Maximization identified mixture models with unique component form on the entire MNIST database (N=70000), and on real cardiac magnetic resonance data (N=276) to illustrate its value for imaging applications.