Bayesian low-rank latent-cluster regression for mixed health outcomes

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the challenges posed by high-dimensional health monitoring data, which often exhibit multicollinearity among predictors, mixed-type responses (Gaussian, Bernoulli, and negative binomial), and latent heterogeneity, making simultaneous clustering, dimensionality reduction, and interpretable modeling difficult. To tackle this, we propose a Bayesian latent class low-rank regression model that represents the response as a finite mixture of regression surfaces, each with its own mean shift and low-rank coefficient matrix. The approach innovatively integrates latent classes, an adaptive low-rank structure, and a mixed exponential family likelihood, employing a multiplicative gamma process to automatically shrink ineffective ranks per class. Model selection for both the number of classes and the maximum rank is performed jointly via WAIC. Theoretical guarantees are provided for posterior consistency of both regression surfaces and predictor-side singular subspaces. Empirical results demonstrate accurate recovery of true cluster numbers across diverse response types, outperforming benchmarks such as K-means and mclust, and revealing interpretable individual patterns and regional clusters in three real-world health applications.

📝 Abstract

High-dimensional health and surveillance studies often involve many collinear predictors, multiple correlated outcomes of different types, and latent heterogeneity across observational units. We propose a Bayesian latent-cluster reduced-rank regression model for multivariate mixed outcomes. The model is a finite mixture of regression surfaces: each latent cluster has a cluster-specific mean shift and a low-rank coefficient matrix, yielding simultaneous clustering, dimension reduction, and component-wise interpretability. Response coordinates may be Gaussian, Bernoulli, or negative binomial. Multiplicative gamma process shrinkage adapts the effective rank within each cluster, and a WAIC-based criterion is used to tune the number of clusters and the nominal maximal rank. We establish posterior contraction for the identifiable component-specific regression surfaces and mean shifts, up to label permutation, and derive corresponding contraction for predictor-side singular subspaces. We also analyze the default label-invariant reporting pipeline based on the posterior similarity matrix: an eigenspace embedding followed by mean shift is shown to consistently recover the latent partition under an additional strong separation margin. Simulation experiments spanning all-Gaussian, all-Bernoulli, all-negative-binomial, and mixed Gaussian--Bernoulli--negative-binomial regimes show accurate recovery of the number of clusters and competitive clustering performance against $K$-means, mclust, PCA-based clustering, and a Gaussian reduced-rank mixture benchmark. We illustrate the method in three applications that show how the model separates individual-level utilization groups and produces interpretable county- and state-level cluster maps together with response-specific posterior predictive maps.

Problem

Research questions and friction points this paper is trying to address.

mixed outcomes

latent heterogeneity

high-dimensional data

collinear predictors

multivariate regression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian latent-cluster

reduced-rank regression

mixed outcomes