Post-clustering Inference under Dependency

📅 2023-10-18

📈 Citations: 1

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Gao et al. (JASA 2022) proposed a post-clustering inference framework limited to i.i.d. Gaussian data, failing to accommodate arbitrary dependency structures among observations and features. Method: We generalize their framework to arbitrary dependence settings by developing a unified, dependence-aware post-clustering inference methodology compatible with hierarchical agglomerative clustering (single/complete/average linkage) and k-means. We derive theoretical conditions for well-defined p-values that ensure selective Type I error control and enable consistent covariance matrix estimation. Contribution/Results: Integrating selective inference, high-dimensional statistics, and covariance structure modeling, we design a robust testing pipeline. Experiments on synthetic and real-world protein structural data demonstrate substantial improvements in statistical reliability and practical utility for testing mean differences between clusters under dependence.

📝 Abstract

Recent work by Gao et al. has laid the foundations for post-clustering inference. For the first time, the authors established a theoretical framework allowing to test for differences between means of estimated clusters. Additionally, they studied the estimation of unknown parameters while controlling the selective type I error. However, their theory was developed for independent observations identically distributed as $p$-dimensional Gaussian variables with a spherical covariance matrix. Here, we aim at extending this framework to a more convenient scenario for practical applications, where arbitrary dependence structures between observations and features are allowed. We show that a $p$-value for post-clustering inference under general dependency can be defined, and we assess the theoretical conditions allowing the compatible estimation of a covariance matrix. The theory is developed for hierarchical agglomerative clustering algorithms with several types of linkages, and for the $k$-means algorithm. We illustrate our method with synthetic data and real data of protein structures.

Problem

Research questions and friction points this paper is trying to address.

Extending post-clustering inference to dependent data structures

Establishing conditions for covariance matrix estimation compatibility

Generalizing framework for hierarchical and k-means clustering algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends post-clustering inference to dependent observations

Allows arbitrary dependence structures between features

Supports hierarchical and k-means clustering algorithms

🔎 Similar Papers

From Logits to Hierarchies: Hierarchical Clustering made Simple