🤖 AI Summary
This paper addresses the efficient merging of run-length compressed Burrows–Wheeler transforms (RLBWTs). We propose a new algorithm with space complexity (O(R)) and time complexity ( ilde{O}(L + sigma + R)), where (R) is the total number of runs in the merged RLBWT, (sigma) is the alphabet size, and (L) is the sum of boundary longest common prefix (LCP) values. Our key methodological innovation is the introduction of *boundary LCP*, enabling adaptive acceleration: for highly repetitive yet divergent string collections—such as pangenomes or multi-reference genomes—(L) is typically much smaller than conventional measures (e.g., total text length). The algorithm builds upon the extended BWT (eBWT) framework, integrating character-block boundary detection with boundary LCP analysis. Experiments demonstrate that our approach significantly outperforms state-of-the-art methods when (L) is small, providing both theoretical foundations and practical tools for constructing compact, scalable indexes over large-scale repetitive sequence data.
📝 Abstract
We show how to merge run-length compressed Burrows-Wheeler Transforms (RLBWTs) quickly and in $O (R)$ space, where $R$ is the total number of runs in them, when a certain parameter is small. Specifically, we consider the boundaries in their combined extended Burrows-Wheeler Transform (eBWT) between blocks of characters from the same original RLBWT, and denote by $L$ the sum of the longest common prefix (LCP) values at those boundaries. We show how to merge the RLBWTs in $ ilde{O} (L + σ+ R)$ time, where $σ$ is the alphabet size. We conjecture that $L$ tends to be small when the strings (or sets of strings) underlying the original RLBWTs are repetitive but dissimilar.