Nested Atoms Model with Application to Clustering Big Population-Scale Single-Cell Data

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
This study addresses the challenge of jointly modeling population-level variables (e.g., genotype) and observation-level variables (e.g., gene expression) in nested data structures such as individuals and their constituent cells. To this end, the authors propose the Nested Atomic Model (NAM), a Bayesian nonparametric approach that uniquely integrates both variable types within a unified hierarchical clustering framework, enabling simultaneous clustering of individuals and cells. An efficient variational Bayesian inference algorithm is developed to scale NAM to high-dimensional single-cell RNA sequencing and genotype datasets. Experiments on the OneK1K dataset demonstrate that NAM identifies groups of individuals with similar genotypes and consistent cell-type compositions, with cell-level clusters showing strong concordance with known immune cell types, thereby effectively uncovering multilayer biological heterogeneity.

Technology Category

Application Category

📝 Abstract
We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell RNA-sequencing (scRNA-seq) data from 982 individuals (groups), totaling 1.27 million cells (observations), along with individual-specific genotype data. This type of data would enable the identification of cell types and the investigation of how genetic variations among individuals influence differences in cell-type profiles. Our goal, therefore, is to jointly cluster cells and individuals to capture the heterogeneity across both levels using cell-specific gene expressions as well as individual-specific genotypes. However, existing grouped clustering methods do not incorporate group-level variables, thereby limiting their ability to capture the heterogeneity of genotypes in our motivating application. To address this, we propose the Nested Atoms Model (NAM), a new Bayesian nonparametric approach that enables the desired two-layered clustering, accounting for both group-level and observation-level variables. To scale NAM for high-dimensional data, we develop a fast variational Bayesian inference algorithm. Simulations show that NAM outperforms existing methods that ignore group-level variables. Applied to the OneK1K dataset, NAM identifies clusters of genetically similar individuals with homogeneous cell-type profiles. The resulting cell clusters align with known immune cell types based on differential gene expression, underscoring the ability of NAM to capture nested heterogeneity and provide biologically meaningful insights.
Problem

Research questions and friction points this paper is trying to address.

nested clustering
single-cell RNA-seq
hierarchical data
group-level variables
population-scale
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nested Atoms Model
Bayesian nonparametrics
two-layered clustering
single-cell RNA-seq
variational inference
🔎 Similar Papers
2024-04-09International Conference on Database Systems for Advanced ApplicationsCitations: 7