Counting on General Run-Length Grammars

📅 2024-05-31
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the open problem posed by Christiansen et al. (2020): “efficiently counting pattern occurrences in compressed text.” We present the first sublinear-time solution for counting occurrences of a length-$m$ pattern in text compressed by an arbitrary run-length encoded context-free grammar (RL-CFG). Our method constructs a compressed index based on the RL-CFG, integrating a hierarchical trie with lightweight interval counting techniques. The index occupies $O(g)$ space—where $g$ is the grammar size—and supports pattern counting in $O(m log^{2+varepsilon} n)$ time. Experiments on real-world compressed datasets demonstrate significant speedups over decompression-and-brute-force matching. This work closes a theoretical gap in pattern counting over general RL-CFG-compressed texts and provides the first solution with both rigorous asymptotic guarantees and practical efficiency.

Technology Category

Application Category

📝 Abstract
We introduce a data structure for counting pattern occurrences in texts compressed with any run-length context-free grammar. Our structure uses space proportional to the grammar size and counts the occurrences of a pattern of length $m$ in a text of length $n$ in time (O(mlog^{2+epsilon} n)), for any constant (epsilon>0) chosen at indexing time. This is the first solution to an open problem posed by Christiansen et al.~[ACM TALG 2020] and enhances our abilities for computation over compressed data; we give an example application.
Problem

Research questions and friction points this paper is trying to address.

Compressed Text Searching
Pattern Matching
Data Compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

compressed text analysis
generic repetitive sequence rules
pattern occurrence counting
🔎 Similar Papers
No similar papers found.
Gonzalo Navarro
Gonzalo Navarro
University of Chile
algorithms and data structurestext searchingcompressiongraph databasessimilarity search
A
Alejandro Pacheco
Center for Biotechnology and Bioengineering (CeBiB), Department of Computer Science, University of Chile, Chile