🤖 AI Summary
This study addresses the inconsistency between predefined clinical diagnostic labels and data-driven clustering. We propose an interpretable, smoothly transitioning, and unit-level robust multi-group Gaussian mixture model (GMM). Methodologically, we introduce the first smooth modeling framework integrating group-structure priors with data-driven mixtures, featuring a likelihood-based smooth GMM formulation, unit-level robust estimation, and an EM-type optimization algorithm; we further derive the breakdown point of the robust estimator theoretically—filling a key theoretical gap. Experiments demonstrate that our model significantly outperforms conventional group-mean estimators, independent GMMs, and non-robust covariance estimators on synthetic data. On real-world medical and cross-domain datasets, it effectively identifies samples near ambiguous group boundaries and locally anomalous units, thereby enhancing grouping interpretability and clinical applicability.
📝 Abstract
Are data groups which are pre-defined by expert opinions or medical diagnoses corresponding to groups based on statistical modeling? For which reason might observations be inconsistent? This contribution intends to answer both questions by proposing a novel multi-group Gaussian mixture model that accounts for the given group context while allowing high flexibility. This is achieved by assuming that the observations of a particular group originate not from a single distribution but from a Gaussian mixture of all group distributions. Moreover, the model provides robustness against cellwise outliers, thus against atypical data cells of the observations. The objective function can be formulated as a likelihood problem and optimized efficiently. We also derive the theoretical breakdown point of the estimators, an innovative result in this context to quantify the degree of robustness to cellwise outliers. Simulations demonstrate the excellent performance and the advantages to alternative models and estimators. Applications from different areas illustrate the strength of the method, particularly in investigating observations which are on the overlap of different groups.