🤖 AI Summary
Efficient compression and fast pattern matching on highly repetitive texts remain challenging due to space–time trade-offs in existing full-text indexes.
Method: This paper proposes a novel compressed full-text index leveraging two complementary compression measures: *r**—the sum of run lengths in both forward and reverse Burrows–Wheeler Transforms (BWT)—and *z*, the number of phrases in the LZ77 parsing. It integrates BWT, LZ77 grammar compression, suffix arrays, and divide-and-conquer query processing into a unified framework.
Contribution/Results: The index achieves *O*(*r** log(*n*/*r**) + *z* log *n*) bits of space—the first to exploit bidirectional BWT run-length structure, overcoming the space bottleneck of conventional *r*-based indexes. It supports substring search in *O*(*m* log *n* + occ log^ε *n*) time, and leftmost/rightmost occurrence reporting in *O*(*m* log^ε *n*) time, significantly improving both space efficiency and query speed on repetitive texts.
📝 Abstract
Let $T [1..n]$ be a text over an alphabet of size $σin mathrm{polylog} (n)$, let $r^*$ be the sum of the numbers of runs in the Burrows-Wheeler Transforms of $T$ and its reverse, and let $z$ be the number of phrases in the LZ77 parse of $T$. We show how to store $T$ in $O (r^* log (n / r^*) + z log n)$ bits such that, given a pattern $P [1..m]$, we can report the locations of the $mathrm{occ}$ occurrences of $P$ in $T$ in $O (m log n + mathrm{occ} log^εn)$ time. We can also report the position of the leftmost and rightmost occurrences of $P$ in $T$ in the same space and $O (m log^εn)$ time.