🤖 AI Summary
To address the computational bottleneck of species identification in nanopore metagenomic sequencing—namely, its reliance on time-consuming sequence alignment—this paper proposes a real-time multi-task learning framework that performs microbial species classification concurrently with basecalling. Methodologically, we design a novel deep neural network featuring a shared encoder and task-specific backward propagation paths, coupled with a configurable top-K classification strategy to enable flexible trade-offs between accuracy and inference latency. Experiments demonstrate that our model achieves state-of-the-art performance on basecalling while attaining 92.5% top-1 and 98.9% top-3 species classification accuracy on the Wick bacterial dataset (17 genomes)—substantially outperforming existing binary-classification approaches. To our knowledge, this is the first work to realize high-accuracy, low-latency, end-to-end signal-level species identification directly from raw nanopore electrical signals.
📝 Abstract
The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.