🤖 AI Summary
To address the trade-off between inadequate uncertainty quantification and low computational efficiency in modeling mixed-type (continuous + categorical) data, this paper proposes the first Bayesian mixture modeling framework based on coordinate-ascent variational inference (CAVI). The method employs latent variables to capture data heterogeneity and complex inter-variable dependencies, and is the first to systematically apply variational inference to Bayesian modeling of mixed data. It ensures asymptotic consistency of posterior means while substantially reducing computational overhead compared to MCMC. Theoretical analysis provides convergence guarantees. Experiments on simulated datasets and the NHANES real-world dataset demonstrate that the proposed approach achieves both high accuracy—delivering comprehensive uncertainty quantification—and high efficiency—reducing computation time by one to two orders of magnitude relative to state-of-the-art methods—making it suitable for large-scale mixed-data applications.
📝 Abstract
Heterogeneous, mixed type datasets including both continuous and categorical variables are ubiquitous, and enriches data analysis by allowing for more complex relationships and interactions to be modelled. Mixture models offer a flexible framework for capturing the underlying heterogeneity and relationships in mixed type datasets. Most current approaches for modelling mixed data either forgo uncertainty quantification and only conduct point estimation, and some use MCMC which incurs a very high computational cost that is not scalable to large datasets. This paper develops a coordinate ascent variational inference algorithm (CAVI) for mixture models on mixed (continuous and categorical) data, which circumvents the high computational cost of MCMC while retaining uncertainty quantification. We demonstrate our approach through simulation studies as well as an applied case study of the NHANES risk factor dataset. In addition, we show that the posterior means from CAVI for this model converge to the true parameter value as the sample size n tends to infinity, providing theoretical justification for our method.