🤖 AI Summary
To address the challenge of jointly modeling both globally shared and group-specific variables in clustered genomic data, this paper proposes the Global-Local Dirichlet Process (GLocal DP), a Bayesian nonparametric model. The GLocal DP unifies heterogeneous multi-omics molecular profiles with cancer-specific clinical variables (e.g., CEA, BMI, smoking pack-years), explicitly capturing both population-level structure and within-group fine-grained heterogeneity. Leveraging stick-breaking constructions and finite mixture approximations, we develop an efficient variational inference algorithm for posterior estimation. Applied to pan-gastrointestinal cancer data, the model identifies clinically and molecularly coordinated subtypes that remain undetected by conventional clustering methods. These refined subtypes enhance biological interpretability and significantly improve characterization of tumor heterogeneity and disease progression mechanisms.
📝 Abstract
We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is common in cancer genomics where the molecular information is usually accompanied by cancer-specific clinical information. Existing grouped clustering methods only consider the shared variables, thereby ignoring valuable information from the cancer-specific variables. To allow for these cancer-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the ``global-local'' structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model, which leads to an efficient posterior inference algorithm. We illustrate our model with extensive simulations and a real pan-gastrointestinal cancer dataset. The cancer-specific clinical variables included carcinoembryonic antigen level, patients' body mass index, and the number of cigarettes smoked per day. These important clinical variables refine the clusters of gene expression data and allow us to identify finer sub-clusters, which is not possible in their absence. This refinement aids in the better understanding of tumor progression and heterogeneity. Moreover, our proposed method is applicable beyond the field of cancer genomics to a general grouped clustering framework in the presence of group-specific idiosyncratic variables.