KeBaB: $k$-mer based breaking for finding super-maximal exact matches

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the inefficiency of super-maximal exact match (SMEM) enumeration between noisy long reads and highly repetitive pangenome reference sequences. The proposed method introduces three key innovations: (1) a *k*-mer–based breakpoint partitioning strategy that restricts SMEM search to high-confidence substrings, drastically reducing the search space; (2) the concept of “pseudo-SMEMs” coupled with a length-driven early-termination mechanism to avoid unnecessary extensions; and (3) an integrated filtering framework combining Bloom filters, dynamic length-threshold pruning, and length-sorted pruning—ensuring completeness while enabling aggressive filtering. Experiments demonstrate that the method significantly reduces total search length and achieves substantial speedups—particularly on highly repetitive references and under high sequencing noise—making it especially suitable for large-scale pangenome alignment tasks.

Technology Category

Application Category

📝 Abstract
Suppose we have a tool for finding super-maximal exact matches (SMEMs) and we want to use it to find all the long SMEMs between a noisy long read $P$ and a highly repetitive pangenomic reference $T$. Notice that if $L geq k$ and the $k$-mer $P [i..i + k - 1]$ does not occur in $T$ then no SMEM of length at least $L$ contains $P [i..i + k - 1]$. Therefore, if we have a Bloom filter for the distinct $k$-mers in $T$ and we want to find only SMEMs of length $L geq k$, then when given $P$ we can break it into maximal substrings consisting only of $k$-mers the filter says occur in $T$ -- which we call pseudo-SMEMs -- and search only the ones of length at least $L$. If $L$ is reasonably large and we can choose $k$ well then the Bloom filter should be small (because $T$ is highly repetitive) but the total length of the pseudo-SMEMs we search should also be small (because $P$ is noisy). Now suppose we are interested only in the longest $t$ SMEMs of length at least $L$ between $P$ and $T$. Notice that once we have found $t$ SMEMs of length at least $ell$ then we need only search for SMEMs of length greater than $ell$. Therefore, if we sort the pseudo-SMEMs into non-increasing order by length, then we can stop searching once we have found $t$ SMEMs at least as long as the next pseudo-SMEM we would search. Our preliminary experiments indicate that these two admissible heuristics may significantly speed up SMEM-finding in practice.
Problem

Research questions and friction points this paper is trying to address.

Find super-maximal exact matches
Handle noisy long reads
Optimize repetitive pangenomic reference search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses k-mer based breaking technique
Employs Bloom filter for efficiency
Sorts pseudo-SMEMs by length
🔎 Similar Papers
2024-07-16arXiv.orgCitations: 0
Nathaniel K. Brown
Nathaniel K. Brown
PhD Student, Johns Hopkins University
Computational GenomicsData StructuresData Compression
A
Anas Alhadi
Faculty of Computer Science, Dalhousie University, Canada
Nour Allam
Nour Allam
Faculty of Computer Science, Dalhousie University, Canada
D
Dove Begleiter
Faculty of Computer Science, Dalhousie University, Canada
N
Nithin Bharathi Kabilan Karpagavalli
Faculty of Computer Science, Dalhousie University, Canada
S
Suchith Sridhar Khajjayam
Faculty of Computer Science, Dalhousie University, Canada
H
Hamza Wahed
Faculty of Computer Science, Dalhousie University, Canada
Travis Gagie
Travis Gagie
Associate Professor at Dalhousie University
data structuresdata compression