Clustering data with values missing at random using scale mixtures of multivariate skew-normal distributions

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This paper addresses clustering of skewed, heavy-tailed incomplete data under the missing-at-random (MAR) mechanism. We propose a model-based clustering framework grounded in the flexible multivariate skew-normal (FMSMSN) distribution family. To our knowledge, this is the first systematic extension of FMSMSN models to handle missing data, jointly modeling both skewness and kurtosis while naturally encompassing the normal distribution as a special case—thereby overcoming the limitations of conventional symmetric models. We develop an EM-type algorithm wherein the E-step exploits closed-form conditional expectations for efficient data augmentation, enabling simultaneous parameter estimation and cluster assignment. Extensive experiments demonstrate robust clustering performance and accurate parameter recovery across varying missing rates, sample sizes, and cluster separability. The method is further validated through empirical analysis of global CO₂ emissions data.

Technology Category

Application Category

📝 Abstract

Handling missing data is a major challenge in model-based clustering, especially when the data exhibit skewness and heavy tails. We address this by extending the finite mixture of scale mixtures of multivariate skew-normal (FMSMSN) family to accommodate incomplete data under a missing at random (MAR) mechanism. Unlike previous work that is limited to one of the special cases of the FMSMSN family, our method offers a cluster analysis methodology for the entire family that accounts for skewness and excess kurtosis amidst data with missing values. The multivariate skew-normal distribution, as parameterised by cite{azzalini1996} and cite{arnoldbeaver} includes the normal distribution as a special case, which ensures that our method is flexible toward existing symmetric model-based clustering techniques under a normality assumption. We derive the distributional properties of the missing components of the data and propose an augmented EM-type algorithm tailored for incomplete observations. The modified E-step yields closed-form expressions for the conditional expectations of the missing values. The simulation experiments showcase the flexibility of the FMSMSN family in both clustering performance and parameter recovery for varying percentages of missing values, while incorporating the effects of sample size and cluster proximity. Finally, we illustrate the practical utility of the proposed method by applying special cases of the FMSMSN family to global CO2 emissions data.

Problem

Research questions and friction points this paper is trying to address.

Clustering incomplete skewed heavy-tailed data

Extending FMSMSN family for missing data

Developing EM algorithm for missing values

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends FMSMSN family for missing data clustering

Uses augmented EM-type algorithm for incomplete data

Incorporates skewness and kurtosis in cluster analysis

🔎 Similar Papers

No similar papers found.