A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Modeling high-dimensional heterogeneous data with non-ignorable missingness (MNAR) coupled with variable selection remains a fundamental challenge. Method: We propose the first unified framework jointly addressing clustering, variable selection, and MNAR mechanism modeling. Our approach introduces an EM-type penalized likelihood estimation framework, incorporating a latent-cluster-dependent missingness model and a data-driven adaptive sparse penalty matrix. Contribution/Results: We establish, for the first time under MNAR assumptions, double asymptotic consistency—simultaneously for clustering and variable selection. Extensive experiments on simulated and real transcriptomic datasets demonstrate substantial improvements in clustering accuracy and relevant gene identification, while maintaining statistical robustness and computational efficiency.

Technology Category

Application Category

📝 Abstract

Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.

Problem

Research questions and friction points this paper is trying to address.

Identifying relevant variables for heterogeneous subgroups in clustering

Handling missing not at random data in model-based clustering

Unifying variable selection and missing data modeling in clustering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for variable selection in clustering

Data-driven penalty matrix for flexible selection

Models missingness-latent class relationship explicitly

🔎 Similar Papers

No similar papers found.