🤖 AI Summary
This paper addresses two fundamental challenges in latent class modeling for high-dimensional binary data: identifying individual class memberships and automatically determining the number of latent classes. We propose a two-stage algorithm comprising spectral clustering initialization followed by a single-step maximum likelihood refinement. Theoretically, under mild regularity conditions, the method achieves optimal latent class recovery and exact clustering consistency. Moreover, we construct a simple, consistent, and tuning-free estimator for the number of latent classes. Extensive simulations and real-data analyses demonstrate that our approach significantly outperforms existing methods in recovery accuracy, computational efficiency, and statistical consistency. Crucially, it offers both rigorous theoretical guarantees—establishing optimality and consistency—and strong practical utility, making it well-suited for high-dimensional binary data analysis.
📝 Abstract
Latent class models are widely used for identifying unobserved subgroups from multivariate categorical data in social sciences, with binary data as a particularly popular example. However, accurately recovering individual latent class memberships and determining the number of classes remains challenging, especially when handling large-scale datasets with many items. This paper proposes a novel two-stage algorithm for latent class models with high-dimensional binary responses. Our method first initializes latent class assignments by an easy-to-implement spectral clustering algorithm, and then refines these assignments with a one-step likelihood-based update. This approach combines the computational efficiency of spectral clustering with the improved statistical accuracy of likelihood-based estimation. We establish theoretical guarantees showing that this method leads to optimal latent class recovery and exact clustering of subjects under mild conditions. Additionally, we propose a simple consistent estimator for the number of latent classes. Extensive experiments on both simulated data and real data validate our theoretical results and demonstrate our method's superior performance over alternative methods.