🤖 AI Summary
Constructing the Burrows–Wheeler Transform (BWT) for long strings incurs substantial time and space overheads, limiting scalability. To address this, we propose an adaptive string partitioning strategy guided by suffix array prefixes, dynamically dividing long sequences into shorter substrings and leveraging a parallel multi-string BWT construction framework. This is the first work to utilize suffix array prefixes for partitioning guidance. Integrated with IBB-index optimization and partDNA-specific implementation, our method achieves memory consumption below 1.5× the input length and up to 3.2× faster construction on real genomic datasets—significantly outperforming state-of-the-art tools. The approach is general-purpose, supporting arbitrary character sets, and introduces a new paradigm for efficient, low-memory BWT construction of large-scale biological sequences.
📝 Abstract
Constructing the Burrows-Wheeler transform (BWT) for long strings poses significant challenges regarding construction time and memory usage. We use a prefix of the suffix array to partition a long string into shorter substrings, thereby enabling the use of multi-string BWT construction algorithms to process these partitions fast. We provide an implementation, partDNA, for DNA sequences. Through comparison with state-of-the-art BWT construction algorithms, we show that partDNA with IBB offers a novel trade-off for construction time and memory usage for BWT construction on real genome datasets. Beyond this, the proposed partitioning strategy is applicable to strings of any alphabet.