Bayesian nonparametric models for zero-inflated count-compositional data using ensembles of regression trees

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses common challenges in compositional count data—such as overdispersion, zero inflation, sample heterogeneity, and nonlinear covariate effects—by proposing a novel Bayesian nonparametric model. The approach uniquely integrates Bayesian additive regression trees (BART) priors into the joint modeling of both the compositional components and the probability of structural zeros. It further incorporates a zero- and N-inflated multinomial distribution along with latent random effects to flexibly capture inter-category dependencies and complex data structures. An efficient data augmentation algorithm enables scalable posterior inference. The model demonstrates superior performance in simulation studies and two real-world applications—microbiome and paleoclimate data—significantly enhancing the ability to model nonlinear relationships between covariates and compositional responses.

Technology Category

Application Category

📝 Abstract
Count-compositional data arise in many different fields, including high-throughput microbiome sequencing and palynology experiments, where a common, important goal is to understand how covariates relate to the observed compositions. Existing methods often fail to simultaneously address key challenges inherent in such data, namely: overdispersion, an excess of zeros, cross-sample heterogeneity, and nonlinear covariate effects. To address these concerns, we propose novel Bayesian models based on ensembles of regression trees. Specifically, we leverage the recently introduced zero-and-$N$-inflated multinomial distribution and assign independent nonparametric Bayesian additive regression tree (BART) priors to both the compositional and structural zero probability components of our model, to flexibly capture covariate effects. We further extend this by adding latent random effects to capture overdispersion and more general dependence structures among the categories. We develop an efficient inferential algorithm combining recent data augmentation schemes with established BART sampling routines. We evaluate our proposed models in simulation studies and illustrate their applicability with two case studies in microbiome and palaeoclimate modelling.
Problem

Research questions and friction points this paper is trying to address.

zero-inflated
count-compositional data
overdispersion
nonlinear covariate effects
cross-sample heterogeneity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian nonparametrics
zero-inflated compositional data
Bayesian additive regression trees
overdispersion
multinomial inflation
🔎 Similar Papers
No similar papers found.
A
André F. B. Menezes
Hamilton Institute and Department of Mathematics and Statistics, Maynooth University
A
Andrew C. Parnell
School of Mathematics and Statistics, Insight Centre for Data Analytics, University College Dublin, Ireland
Keefe Murphy
Keefe Murphy
Maynooth University
Computational StatisticsBayesian StatisticsModel-based ClusteringStatistical Machine Learning