🤖 AI Summary
Existing methods struggle to effectively handle extensive missing values in high-dimensional longitudinal data and fail to disentangle between-subject heterogeneity from within-subject temporal dynamics. This work proposes Hierarchical Probabilistic Principal Component Analysis (HPPCA), a novel approach that explicitly separates these two sources of variation through a two-level probabilistic factor model: between-subject variability is captured by global latent factors, while within-subject temporal dynamics are modeled via Gaussian processes. By integrating an EM algorithm with flexible covariance kernels, HPPCA efficiently accommodates missing observations. Notably, it introduces a hierarchical structure into the probabilistic PCA framework for the first time to model nested longitudinal variation. Experimental results demonstrate that HPPCA significantly outperforms standard PPCA and multivariate functional PCA under high missingness rates, achieving more accurate latent subspace recovery and improved performance in clinical outcome prediction and masked record reconstruction.
📝 Abstract
In many longitudinal studies, a large number of variables are measured repeatedly over time, with substantial missing data. Existing methods, such as probabilistic principal component analysis (PPCA), are ill-equipped to handle such incomplete, high-dimensional longitudinal data, as they fail to account for the nested sources of variation and temporal dependency inherent in repeated measures. We introduce hierarchical probabilistic principal component analysis (HPPCA), a two-level probabilistic factor model that explicitly separates between-subject variance from time-varying within-subject dynamics. The within-subject latent factors are modeled by a Gaussian process. We develop an EM algorithm to handle missing data and flexible covariance kernels, accelerated by computationally efficient initializers. Simulation studies demonstrated that HPPCA robustly recovers model parameters subspaces and substantially outperforms both standard PPCA and multivariate functional PCA in imputation accuracy, even under heavy missingness and model misspecification. An application to the long COVID symptoms in the Researching COVID to Enhance Recovery adult cohort revealed that HPPCA effectively captured the data's hierarchical structure and its learned features significantly improved the prediction of clinical outcomes and the recovery of masked clinical records compared to exisiting methods.