🤖 AI Summary
Gao et al. (JASA 2022) proposed a post-clustering inference framework limited to i.i.d. Gaussian data, failing to accommodate arbitrary dependency structures among observations and features.
Method: We generalize their framework to arbitrary dependence settings by developing a unified, dependence-aware post-clustering inference methodology compatible with hierarchical agglomerative clustering (single/complete/average linkage) and k-means. We derive theoretical conditions for well-defined p-values that ensure selective Type I error control and enable consistent covariance matrix estimation.
Contribution/Results: Integrating selective inference, high-dimensional statistics, and covariance structure modeling, we design a robust testing pipeline. Experiments on synthetic and real-world protein structural data demonstrate substantial improvements in statistical reliability and practical utility for testing mean differences between clusters under dependence.
📝 Abstract
Recent work by Gao et al. has laid the foundations for post-clustering inference. For the first time, the authors established a theoretical framework allowing to test for differences between means of estimated clusters. Additionally, they studied the estimation of unknown parameters while controlling the selective type I error. However, their theory was developed for independent observations identically distributed as $p$-dimensional Gaussian variables with a spherical covariance matrix. Here, we aim at extending this framework to a more convenient scenario for practical applications, where arbitrary dependence structures between observations and features are allowed. We show that a $p$-value for post-clustering inference under general dependency can be defined, and we assess the theoretical conditions allowing the compatible estimation of a covariance matrix. The theory is developed for hierarchical agglomerative clustering algorithms with several types of linkages, and for the $k$-means algorithm. We illustrate our method with synthetic data and real data of protein structures.