Global-Local Dirichlet Processes for Identifying Pan-Cancer Subpopulations Using Both Shared and Cancer-Specific Data

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of jointly modeling both globally shared and group-specific variables in clustered genomic data, this paper proposes the Global-Local Dirichlet Process (GLocal DP), a Bayesian nonparametric model. The GLocal DP unifies heterogeneous multi-omics molecular profiles with cancer-specific clinical variables (e.g., CEA, BMI, smoking pack-years), explicitly capturing both population-level structure and within-group fine-grained heterogeneity. Leveraging stick-breaking constructions and finite mixture approximations, we develop an efficient variational inference algorithm for posterior estimation. Applied to pan-gastrointestinal cancer data, the model identifies clinically and molecularly coordinated subtypes that remain undetected by conventional clustering methods. These refined subtypes enhance biological interpretability and significantly improve characterization of tumor heterogeneity and disease progression mechanisms.

Technology Category

Application Category

📝 Abstract
We consider the problem of clustering grouped data for which the observations may include group-specific variables in addition to the variables that are shared across groups. This type of data is common in cancer genomics where the molecular information is usually accompanied by cancer-specific clinical information. Existing grouped clustering methods only consider the shared variables, thereby ignoring valuable information from the cancer-specific variables. To allow for these cancer-specific variables to aid in the clustering, we propose a novel Bayesian nonparametric approach, termed global-local (GLocal) Dirichlet process, that models the ``global-local'' structure of the observations across groups. We characterize the GLocal Dirichlet process using the stick-breaking representation and the representation as a limit of a finite mixture model, which leads to an efficient posterior inference algorithm. We illustrate our model with extensive simulations and a real pan-gastrointestinal cancer dataset. The cancer-specific clinical variables included carcinoembryonic antigen level, patients' body mass index, and the number of cigarettes smoked per day. These important clinical variables refine the clusters of gene expression data and allow us to identify finer sub-clusters, which is not possible in their absence. This refinement aids in the better understanding of tumor progression and heterogeneity. Moreover, our proposed method is applicable beyond the field of cancer genomics to a general grouped clustering framework in the presence of group-specific idiosyncratic variables.
Problem

Research questions and friction points this paper is trying to address.

Clustering grouped data with shared and group-specific variables
Identifying cancer subpopulations using genomic and clinical data
Developing Bayesian nonparametric method for pan-cancer analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

GLocal Dirichlet process models global-local structure
Stick-breaking representation enables efficient posterior inference
Incorporates cancer-specific variables to refine clustering results
🔎 Similar Papers
No similar papers found.
A
Arhit Chakrabarti
Department of Statistics, Texas A&M University
Y
Yang Ni
Department of Statistics and Data Sciences, University of Texas at Austin
Debdeep Pati
Debdeep Pati
Professor, Department of Statistics, University of Wisconsin - Madison
Bayesian nonparametricshigh-dimensional data analysis
B
Bani K. Mallick
Department of Statistics, Texas A&M University