sanba: An R Package for Bayesian Clustering of Distributions via Shared Atoms Nested Models

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Bayesian joint clustering of individuals and groups in grouped data remains computationally prohibitive for large-scale applications due to the high complexity of existing nested mixture models. Method: We propose the Shared-Atom Nested Mixture (SANM) model, which introduces a group-shared atomic structure to simultaneously cluster both individual observations and groups while flexibly estimating group-specific densities. We develop an inference framework integrating efficient MCMC sampling with scalable variational inference tailored for massive datasets; core algorithms are implemented in C++ and distributed as the R package *sanba*. Contribution/Results: SANM substantially improves computational feasibility and scalability. Extensive experiments on synthetic and real-world data demonstrate superior clustering accuracy, high-fidelity density estimation, and significantly reduced runtime—establishing a new paradigm for hierarchical grouped-data modeling that balances theoretical rigor with practical utility.

Technology Category

Application Category

📝 Abstract

Nested data structures arise when observations are grouped into distinct units, such as patients within hospitals or students within schools. Accounting for this hierarchical organization is essential for valid inference, as ignoring it can lead to biased estimates and poor generalization. This article addresses the challenge of clustering both individual observations and their corresponding groups while flexibly estimating group-specific densities. Bayesian nested mixture models offer a principled and robust framework for this task. However, their practical use has often been limited by computational complexity. To overcome this barrier, we present sanba, an R package for Bayesian analysis of grouped data using nested mixture models with a shared set of atoms, a structure recently introduced in the statistical literature. The package provides multiple inference strategies, including state-of-the-art Markov Chain Monte Carlo routines and variational inference algorithms tailored for large-scale datasets. All core functions are implemented in C++ and seamlessly integrated into R, making sanba a fast and user-friendly tool for fitting nested mixture models with modern Bayesian algorithms.

Problem

Research questions and friction points this paper is trying to address.

Clustering individual and group-level data simultaneously

Flexibly estimating group-specific densities accurately

Overcoming computational complexity in Bayesian nested models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian nested mixture models for clustering

Shared atoms structure for grouped data

C++-accelerated MCMC and variational inference

🔎 Similar Papers

Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE