🤖 AI Summary
K-modes clustering—widely adopted in Physics Education Research (PER)—lacks a probabilistic foundation, hindering quantification of classification uncertainty, integration of external covariates, and principled assignment of individuals to subgroups. Method: We introduce Latent Class Analysis (LCA) as a model-based clustering paradigm for PER, implementing it in R with BIC/AIC for model selection and validating performance via parallel simulation and empirical studies. Contribution/Results: LCA robustly identifies cognitively distinct student subpopulations, explicitly quantifies classification uncertainty, and naturally enables joint modeling of latent classes with background variables. Compared to k-modes, LCA enhances interpretability of subgroups and strengthens causal inference for educational interventions. This work establishes a generalizable, probabilistic clustering framework for PER, advancing methodological rigor and theoretical grounding in educational data analysis.
📝 Abstract
Clustering methods are often used in physics education research (PER) to identify subgroups of individuals within a population who share similar response patterns or characteristics. K-means (or k-modes, for categorical data) is one of the most commonly used clustering methods in PER. This algorithm, however, is not model-based: it relies on algorithmic partitioning and assigns individuals to subgroups with definite membership. Researchers must also conduct post-hoc analyses to relate subgroup membership to other variables. Mixture models offer a model-based alternative that accounts for classification errors and allows researchers to directly integrate subgroup membership into a broader latent variable framework. In this paper, we outline the theoretical similarities and differences between k-modes clustering and latent class analysis (one type of mixture model for categorical data). We also present parallel analyses using each method to address the same research questions in order to demonstrate these similarities and differences. We provide the data and R code to replicate the worked example presented in the paper for researchers interested in using mixture models.