🤖 AI Summary
This work addresses multi-task binary classification in microbiome studies (e.g., cross-cohort health status prediction) by proposing a novel method that models shared sparse structures across tasks. It introduces the first integration of hierarchical sparse Bayesian priors with scalable variational inference, enabling joint modeling of heterogeneous multi-source microbiome data while simultaneously calibrating posterior uncertainty and identifying discriminative microbial taxa. The method ensures interpretability and generalizability: it significantly improves sparse support recovery accuracy on synthetic benchmarks; achieves state-of-the-art predictive performance on multi-center real-world microbiome datasets; demonstrates robustness to technical heterogeneity—including batch effects and sequencing depth variability; and identifies biologically coherent, cross-cohort reproducible microbial signatures.
📝 Abstract
This paper proposes a hierarchical Bayesian multitask learning model that is applicable to the general multi-task binary classification learning problem where the model assumes a shared sparsity structure across different tasks. We derive a computationally efficient inference algorithm based on variational inference to approximate the posterior distribution. We demonstrate the potential of the new approach on various synthetic datasets and for predicting human health status based on microbiome profile. Our analysis incorporates data pooled from multiple microbiome studies, along with a comprehensive comparison with other benchmark methods. Results in synthetic datasets show that the proposed approach has superior support recovery property when the underlying regression coefficients share a common sparsity structure across different tasks. Our experiments on microbiome classification demonstrate the utility of the method in extracting informative taxa while providing well-calibrated predictions with uncertainty quantification and achieving competitive performance in terms of prediction metrics. Notably, despite the heterogeneity of the pooled datasets (e.g., different experimental objectives, laboratory setups, sequencing equipment, patient demographics), our method delivers robust results.