Bayesian nonparametric models for zero-inflated count-compositional data using ensembles of regression trees

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study addresses common challenges in compositional count data—such as overdispersion, zero inflation, sample heterogeneity, and nonlinear covariate effects—by proposing a novel Bayesian nonparametric model. The approach uniquely integrates Bayesian additive regression trees (BART) priors into the joint modeling of both the compositional components and the probability of structural zeros. It further incorporates a zero- and N-inflated multinomial distribution along with latent random effects to flexibly capture inter-category dependencies and complex data structures. An efficient data augmentation algorithm enables scalable posterior inference. The model demonstrates superior performance in simulation studies and two real-world applications—microbiome and paleoclimate data—significantly enhancing the ability to model nonlinear relationships between covariates and compositional responses.

Technology Category

Application Category

📝 Abstract

Count-compositional data arise in many different fields, including high-throughput microbiome sequencing and palynology experiments, where a common, important goal is to understand how covariates relate to the observed compositions. Existing methods often fail to simultaneously address key challenges inherent in such data, namely: overdispersion, an excess of zeros, cross-sample heterogeneity, and nonlinear covariate effects. To address these concerns, we propose novel Bayesian models based on ensembles of regression trees. Specifically, we leverage the recently introduced zero-and-$N$-inflated multinomial distribution and assign independent nonparametric Bayesian additive regression tree (BART) priors to both the compositional and structural zero probability components of our model, to flexibly capture covariate effects. We further extend this by adding latent random effects to capture overdispersion and more general dependence structures among the categories. We develop an efficient inferential algorithm combining recent data augmentation schemes with established BART sampling routines. We evaluate our proposed models in simulation studies and illustrate their applicability with two case studies in microbiome and palaeoclimate modelling.

Problem

Research questions and friction points this paper is trying to address.

zero-inflated

count-compositional data

overdispersion

nonlinear covariate effects

cross-sample heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian nonparametrics

zero-inflated compositional data

Bayesian additive regression trees