Compressed Dictionary Matching on Run-Length Encoded Strings

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This paper addresses dictionary matching in the run-length encoded (RLE) compressed domain: given both a pattern set and a text represented in RLE, the goal is to locate all pattern occurrences without decompression. To this end, we introduce the first compact representation of an Aho-Corasick automaton tailored for RLE-compressed strings and design an RLE-aware string index supporting efficient queries. Our algorithm achieves expected time complexity (O(( au_m + au_n) log log m + mathit{occ})) and space complexity (O( au_m)), where ( au_m) and ( au_n) denote the numbers of RLE runs in the pattern set and text, respectively, and (mathit{occ}) is the number of matches. This is the first dictionary matching solution whose efficiency scales with compressed-domain parameters—rather than uncompressed lengths—and approaches the theoretical lower bound in terms of RLE-run counts. It thus substantially overcomes the performance bottleneck inherent in conventional decompress-then-process paradigms.

Technology Category

Application Category

📝 Abstract

Given a set of pattern strings $mathcal{P}={P_1, P_2,ldots P_k}$ and a text string $S$, the classic dictionary matching problem is to report all occurrences of each pattern in $S$. We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let $m$ and $n$ be the total length of the patterns $mathcal{P}$ and the length of the text string $S$, respectively, and let $overline{m}$ and $overline{n}$ be the total number of runs in the run-length encoding of the patterns in $mathcal{P}$ and $S$, respectively. Our main result is an algorithm that achieves $O( (overline{m} + overline{n})log log m + mathrm{occ})$ expected time, and $O(overline{m})$ space, where $mathrm{occ}$ is the total number of occurrences of patterns in $S$. This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an $log log m$ factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in run-length encoded strings.

Problem

Research questions and friction points this paper is trying to address.

Efficient dictionary matching on run-length compressed strings

Solving without decompression using optimized space and time

Introducing new techniques for compressed automata and indexing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed Aho-Corasick automaton representation

Efficient string index for RLE queries

Optimal dictionary matching without decompression

🔎 Similar Papers

Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition