🤖 AI Summary
To address the low accuracy and inefficiency of DNA barcoding for genus- and species-level classification, this study introduces the first self-supervised Transformer model family specifically designed for DNA barcodes. Leveraging 1.5 million invertebrate COI sequences, we incorporate domain-specific biological priors to devise a tailored masking strategy and tokenization scheme, enabling efficient pretraining. Compared with fine-tuned general-purpose DNA foundation models and conventional machine learning methods, our model achieves BLAST-level accuracy on species-level classification while accelerating inference by 55×. Moreover, it significantly outperforms supervised neural networks and existing foundation models on genus- and species-level identification tasks. This work establishes a scalable, high-throughput, and high-accuracy paradigm for biodiversity monitoring, bridging critical gaps between deep learning and molecular taxonomy.
📝 Abstract
In the global challenge of understanding and characterizing biodiversity, short species-specific genomic sequences known as DNA barcodes play a critical role, enabling fine-grained comparisons among organisms within the same kingdom of life. Although machine learning algorithms specifically designed for the analysis of DNA barcodes are becoming more popular, most existing methodologies rely on generic supervised training algorithms. We introduce BarcodeBERT, a family of models tailored to biodiversity analysis and trained exclusively on data from a reference library of 1.5M invertebrate DNA barcodes. We compared the performance of BarcodeBERT on taxonomic identification tasks against a spectrum of machine learning approaches including supervised training of classical neural architectures and fine-tuning of general DNA foundation models. Our self-supervised pretraining strategies on domain-specific data outperform fine-tuned foundation models, especially in identification tasks involving lower taxa such as genera and species. We also compared BarcodeBERT with BLAST, one of the most widely used bioinformatics tools for sequence searching, and found that our method matched BLAST's performance in species-level classification while being 55 times faster. Our analysis of masking and tokenization strategies also provides practical guidance for building customized DNA language models, emphasizing the importance of aligning model training strategies with dataset characteristics and domain knowledge. The code repository is available at https://github.com/bioscan-ml/BarcodeBERT.