🤖 AI Summary
Heterogeneous asthma suffers from ambiguous phenotype definitions, non-ignorable missingness in electronic health record (EHR) data, and limited interpretability. Method: We propose a prior-knowledge-guided Bayesian latent class model (PLCA) that encodes clinical knowledge as informative priors to jointly model phenotype structure and missing-data mechanisms. The method integrates unsupervised clustering, missing-data modeling, and individualized probabilistic assignment, enabling flexible and reproducible phenotype discovery. Results: Applied to a cohort of 44,000 asthma patients, PLCA identified a “poorly controlled T2-high” subgroup comprising 38.7% of the cohort—characterized by elevated peripheral blood eosinophils, enrichment of allergic biomarkers, and high healthcare utilization. This framework represents the first Bayesian integration of domain knowledge with data-driven learning, preserving statistical rigor while substantially enhancing clinical relevance and interpretability of discovered phenotypes.
📝 Abstract
Objectives: Unsupervised learning with electronic health record (EHR) data has shown promise for phenotype discovery, but approaches typically disregard existing clinical information, limiting interpretability. We operationalize a Bayesian latent class framework for phenotyping that incorporates domain-specific knowledge to improve clinical meaningfulness of EHR-derived phenotypes and illustrate its utility by identifying an asthma sub-phenotype informed by features of Type 2 (T2) inflammation. Materials and methods: We illustrate a framework for incorporating clinical knowledge into a Bayesian latent class model via informative priors to guide unsupervised clustering toward clinically relevant subgroups. This approach models missingness, accounting for potential missing-not-at-random patterns, and provides patient-level probabilities for phenotype assignment with uncertainty. Using reusable and flexible code, we applied the model to a large asthma EHR cohort, specifying informative priors for T2 inflammation-related features and weakly informative priors for other clinical variables, allowing the data to inform posterior distributions. Results and Conclusion: Using encounter data from January 2017 to February 2024 for 44,642 adult asthma patients, we found a bimodal posterior distribution of phenotype assignment, indicating clear class separation. The T2 inflammation-informed class (38.7%) was characterized by elevated eosinophil levels and allergy markers, plus high healthcare utilization and medication use, despite weakly informative priors on the latter variables. These patterns suggest an"uncontrolled T2-high"sub-phenotype. This demonstrates how our Bayesian latent class modeling approach supports hypothesis generation and cohort identification in EHR-based studies of heterogeneous diseases without well-established phenotype definitions.