SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences

📅 2023-10-25
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing gene sequence alignment methods lack systematic, cross-platform evaluation. Method: We introduce the first benchmark platform specifically designed for multi-platform sequencing data (Illumina, Oxford Nanopore Technology, and PacBio), systematically evaluating 11 state-of-the-art aligners—including exact, heuristic, and learning-enhanced algorithms. Contribution/Results: Our end-to-end empirical analysis quantifies, for the first time, the high sensitivity of alignment performance to both sequencing data quality and hyperparameter configurations. We propose a standardized four-dimensional evaluation framework assessing accuracy, speed, memory footprint, and noise robustness. Results reveal widespread deficiencies in robustness and resource efficiency across current tools. To support reproducibility and methodological advancement, we open-source a fully documented, end-to-end benchmarking pipeline on GitHub—providing an evidence-based foundation for algorithm selection, comparative analysis, and future alignment method development.
📝 Abstract
Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, optimized exact and heuristic algorithms are still being developed. We find that these methods are highly sensitive to the underlying data, its quality, and various hyperparameters. Despite their wide use, no in-depth analysis has been performed, potentially falsely discarding genetic sequences from further analysis and unnecessarily inflating computational costs. We provide the first analysis and benchmark of this heterogeneity. We deliver an actionable overview of the 11 most widely used state-of-the-art methods for comparing genomic sequences. We also inform readers about their advantages and downsides using thorough experimental evaluation and different real datasets from all major manufacturers (i.e., Illumina, ONT, and PacBio). SequenceLab is publicly available at https://github.com/CMU-SAFARI/SequenceLab.
Problem

Research questions and friction points this paper is trying to address.

Gene sequence analysis
Performance evaluation
Computational complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive Analysis
SequenceLab Benchmarking Tool
Real Experimental Data
🔎 Similar Papers
No similar papers found.
M
Maximilian-David Rumpf
Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zurich, Switzerland
Mohammed Alser
Mohammed Alser
TT Assistant Professor, GSU, ALSER Lab
BioinformaticsMetagenomicsComputational GenomicsComputer Architecture
Arvid E. Gollwitzer
Arvid E. Gollwitzer
MIT | ETH Zurich | Broad Institute of MIT and Harvard | CERN
Computational GenomicsClinical MetagenomicsCancer DetectionTargeted Drug Delivery
J
Joël Lindegger
Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zurich, Switzerland
N
Nour Almadhoun
Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zurich, Switzerland
Can Firtina
Can Firtina
Assistant Professor of Computer Science, UMD
BioinformaticsComputer ArchitectureHardware-Software Co-design
Serghei Mangul
Serghei Mangul
USC
GenomicsBioinformatics
O
Onur Mutlu
Department of Information Technology and Electrical Engineering, ETH Zürich, 8092 Zurich, Switzerland