🤖 AI Summary
Microbiome data—characterized by high dimensionality, sparsity, small sample sizes, and high biological variability—pose significant challenges for accurate classification. To address this, we propose TaxaPLN, a data augmentation framework that jointly leverages taxonomic hierarchy and covariate information. Our method embeds the microbial taxonomy tree into a Poisson Lognormal (PLN) generative model and introduces Feature-wise Linear Modulation (FiLM) to enable covariate-conditioned synthesis, ensuring generated samples are both ecologically plausible and phenotypically relevant. A data-driven sampling strategy further enhances generation fidelity, thereby improving generalization of downstream nonlinear classifiers. Experiments across diverse trait prediction tasks demonstrate that TaxaPLN consistently outperforms state-of-the-art baselines, with particularly pronounced accuracy gains in low-sample regimes—e.g., up to +12.7% absolute improvement in classification accuracy under extreme scarcity (n < 30 per class).
📝 Abstract
The gut microbiome plays a crucial role in human health, making it a corner stone of modern biomedical research. To study its structure and dynamics, machine learning models are increasingly used to identify key microbial patterns associated with disease and environmental factors. However, microbiome data present unique challenges due to their compositionality, high-dimensionality, sparsity, and high variability, which can obscure meaningful signals. Besides, the effectiveness of machine learning models is often constrained by limited sample sizes, as microbiome data collection remains costly and time consuming. In this context, data augmentation has emerged as a promising strategy to enhance model robustness and predictive performance by generating artificial microbiome data. The aim of this study is to improve predictive modeling from microbiome data by introducing a model-based data augmentation approach that incorporates both taxonomic relationships and covariate information. To that end, we propose TaxaPLN, a data augmentation method built on PLN-Tree generative models, which leverages the taxonomy and a data-driven sampler to generate realistic synthetic microbiome compositions. We further introduce a conditional extension based on feature-wise linear modulation, enabling covariate-aware generation. Experiments on high-quality curated microbiome datasets show that TaxaPLN preserves ecological properties and generally improves or maintains predictive performances, particularly with non-linear classifiers, outperforming state-of-the-art baselines. Besides, TaxaPLN conditional augmentation establishes a novel benchmark for covariate-aware microbiome augmentation. The MIT-licensed source code is available at https://github.com/ AlexandreChaussard/PLNTree-package along with the datasets used in our experiments.