🤖 AI Summary
This paper addresses dictionary matching in the run-length encoded (RLE) compressed domain: given both a pattern set and a text represented in RLE, the goal is to locate all pattern occurrences without decompression. To this end, we introduce the first compact representation of an Aho-Corasick automaton tailored for RLE-compressed strings and design an RLE-aware string index supporting efficient queries. Our algorithm achieves expected time complexity (O(( au_m + au_n) log log m + mathit{occ})) and space complexity (O( au_m)), where ( au_m) and ( au_n) denote the numbers of RLE runs in the pattern set and text, respectively, and (mathit{occ}) is the number of matches. This is the first dictionary matching solution whose efficiency scales with compressed-domain parameters—rather than uncompressed lengths—and approaches the theoretical lower bound in terms of RLE-run counts. It thus substantially overcomes the performance bottleneck inherent in conventional decompress-then-process paradigms.
📝 Abstract
Given a set of pattern strings $mathcal{P}={P_1, P_2,ldots P_k}$ and a text string $S$, the classic dictionary matching problem is to report all occurrences of each pattern in $S$. We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let $m$ and $n$ be the total length of the patterns $mathcal{P}$ and the length of the text string $S$, respectively, and let $overline{m}$ and $overline{n}$ be the total number of runs in the run-length encoding of the patterns in $mathcal{P}$ and $S$, respectively. Our main result is an algorithm that achieves $O( (overline{m} + overline{n})log log m + mathrm{occ})$ expected time, and $O(overline{m})$ space, where $mathrm{occ}$ is the total number of occurrences of patterns in $S$. This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an $log log m$ factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in run-length encoded strings.