🤖 AI Summary
Clustering high-dimensional sparse data where true signals lie in an unknown low-dimensional subspace remains challenging, especially when precise estimation of sparsity patterns or precision matrices is infeasible.
Method: We propose an iterative framework that jointly integrates minimum–maximum separation-bound-driven sparse feature selection with semidefinite programming (SDP) relaxation of K-means clustering—without requiring explicit estimation of sparsity parameters or the precision matrix. In each iteration, discriminative direction-based thresholding selects relevant features, while SDP relaxation solves the clustering assignment. The algorithm relies solely on computable low-order statistics, circumventing high-dimensional covariance estimation.
Contribution/Results: We establish statistical consistency under high-dimensional sparse settings. Extensive simulations demonstrate that our method maintains high label recovery accuracy as dimensionality increases, significantly outperforming state-of-the-art baselines while ensuring robustness and computational tractability.
📝 Abstract
We propose an iterative algorithm for clustering high-dimensional data, where the true signal lies in a much lower-dimensional space. Our method alternates between feature selection and clustering, without requiring precise estimation of sparse model parameters. Feature selection is performed by thresholding a rough estimate of the discriminative direction, while clustering is carried out via a semidefinite programming (SDP) relaxation of K-means. In the isotropic case, the algorithm is motivated by the minimax separation bound for exact recovery of cluster labels using varying sparse subsets of features. This bound highlights the critical role of variable selection in achieving exact recovery. We further extend the algorithm to settings with unknown sparse precision matrices, avoiding full model parameter estimation by computing only the minimally required quantities. Across a range of simulation settings, we find that the proposed iterative approach outperforms several state-of-the-art methods, especially in higher dimensions.