π€ AI Summary
This study addresses the identification of multimorbidity clusters significantly associated with mortality from large-scale electronic health records. To this end, we propose a Bayesian profile regression model conditioned on covariates and mortality outcomes, integrated with a Dirichlet process mixture model to automatically infer the number of clusters. Methodologically, we introduce full-rank stochastic variational inference (SVI) into this framework for the first time, achieving computational efficiency substantially greater than that of traditional NUTS samplers while maintaining comparable accuracy. Applied to real-world data from 1,296,463 individuals, the model identified 33 distinct disease clusters, with clusters such as metastatic cancer and heart failure showing strong associations with elevated mortality risk, thereby demonstrating the methodβs validity and scalability.
π Abstract
Multiple long-term conditions (MLTC) are increasingly observed in clinical practice globally. Clustering methods to group diseases into commonly co-occurring clusters have been of interest for further understanding of how MLTC group together and their associated impact on patient outcomes. However, such approaches require large, often population-scale datasets. Bayesian Profile Regression (BPR) is a statistical model that combines a Dirichlet Process Mixture model with a hierarchical regression model, in order to form clusters of items conditional on covariates and an outcome of interest. We developed a BPR model using full-rank Stochastic Variational Inference (SVI) for application in large-scale data. We assessed it's performance using simulation studies comparing fits using the No-U-turn (NUTS) sampler and full-rank SVI. We then fit a BPR model to find clusters of MLTC in a population-scale data held in the Secure Anonymised Information Linkage (SAIL) databank. We found results from full-rank SVI compared well with results from NUTS in a simulation study, and the improved fitting performance allowed for fitting models in population-scale datasets. There were 1,296,463 individuals in our electronic health record (EHR) cohort. The clustering model was conditioned on age at cohort entry, socioeconomic deprivation and sex with mortality as the outcome. We used the Elixhauser comorbidity index disease definitions, and found there were 33 disease clusters. We found that clusters featuring metastatic cancer and cardiovascular diseases, such as congestive heart failure, were most strongly associated with the probability of mortality. Our findings show that SVI can be a useful and accurate method for fitting Bayesian models, especially when the dataset size would make Monte Carlo methods prohibitively time consuming or impossible.