Compressing Suffix Trees by Path Decompositions

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This paper addresses the high space overhead and low query efficiency of suffix trees. We propose a novel compression paradigm based on prefix-colexicographic path decomposition: the suffix tree is decomposed into disjoint root-to-leaf paths, each represented solely by an index pointing to a text suffix. Our key insight is that the resulting path index array, when sorted in prefix-colexicographic order, simultaneously achieves cache efficiency and linear compressibility. The method integrates prefix-colexicographic sorting, suffix path decomposition, random-access text indexing, and optimized binary search. Experimental results demonstrate that, compared to the state-of-the-art r-index, our index size scales linearly with the compressed text size—yielding significantly smaller space consumption. Furthermore, locate queries are accelerated by one to two orders of magnitude, while pattern matching throughput improves up to 100×, substantially enhancing both time and space efficiency.

Technology Category

Application Category

📝 Abstract

In classic suffix trees, path compression works by replacing unary suffix trie paths with pairs of pointers to $T$, which must be available in the form of some random access oracle at query time. In this paper, we revisit path compression and show that a more careful choice of pointers leads to a new elegant, simple, and remarkably efficient way to compress the suffix tree. We begin by observing that an alternative way to path-compress the suffix trie of $T$ is to decompose it into a set of (disjoint) node-to-leaf paths and then represent each path as a pointer $i$ to one of the string's suffixes $T[i,n]$. At this point, we show that the array $A$ of such indices $i$, sorted by the colexicographic order of the corresponding text prefixes $T[1,i]$, possesses the following properties: (i) it supports emph{cache-efficient} pattern matching queries via simple binary search on $A$ and random access on $T$, and (ii) it contains a number of entries being proportional to the size of the emph{compressed text}. Of particular interest is the path decomposition given by the colexicographic rank of $T$'s prefixes. The resulting index is smaller and orders of magnitude faster than the $r$-index on the task of locating all occurrences of a query pattern.

Problem

Research questions and friction points this paper is trying to address.

Compress suffix trees efficiently using path decompositions

Improve pattern matching via cache-efficient binary search

Reduce index size compared to existing methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Path decomposition for suffix tree compression

Colexicographic rank for efficient indexing

Binary search on array for pattern matching

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models