🤖 AI Summary
Fungal DNA barcoding classification faces three key challenges: label sparsity, long-tailed class distribution, and difficulty in modeling hierarchical taxonomic structure—leading to poor generalization and hierarchical inconsistency in conventional supervised methods. To address these, we propose the first domain-specific state space model (SSM) for fungal barcoding, establishing a pretraining–fine-tuning paradigm. Our method introduces hierarchical label smoothing, class-weighted cross-entropy loss, and a MycoAI-inspired multi-head hierarchical classifier that explicitly enforces phylogenetic constraints across six taxonomic levels: phylum, class, order, family, genus, and species. Evaluated on a fungal classification benchmark with distributional shift, our approach achieves state-of-the-art accuracy at all taxonomic levels, significantly improving zero-shot generalization and hierarchical consistency. The implementation is publicly available.
📝 Abstract
Accurate taxonomic classification from DNA barcodes is a cornerstone of global biodiversity monitoring, yet fungi present extreme challenges due to sparse labelling and long-tailed taxa distributions. Conventional supervised learning methods often falter in this domain, struggling to generalize to unseen species and to capture the hierarchical nature of the data. To address these limitations, we introduce BarcodeMamba+, a foundation model for fungal barcode classification built on a powerful and efficient state-space model architecture. We employ a pretrain and fine-tune paradigm, which utilizes partially labelled data and we demonstrate this is substantially more effective than traditional fully-supervised methods in this data-sparse environment. During fine-tuning, we systematically integrate and evaluate a suite of enhancements--including hierarchical label smoothing, a weighted loss function, and a multi-head output layer from MycoAI--to specifically tackle the challenges of fungal taxonomy. Our experiments show that each of these components yields significant performance gains. On a challenging fungal classification benchmark with distinct taxonomic distribution shifts from the broad training set, our final model outperforms a range of existing methods across all taxonomic levels. Our work provides a powerful new tool for genomics-based biodiversity research and establishes an effective and scalable training paradigm for this challenging domain. Our code is publicly available at https://github.com/bioscan-ml/BarcodeMamba.