🤖 AI Summary
Existing LZ-family compression schemes struggle to simultaneously achieve high compression ratios and efficient random access: traditional LZ variants enable fast access but sacrifice compression efficiency, while syntactic compression methods (e.g., grammar compression) support O(log n)-time access with linear space overhead yet are incompatible with general LZ factorizations. This paper introduces LZSE—the first general LZ-style compression scheme supporting O(log n)-time random access. Its core innovation is the *copy factor*, a novel primitive that references contiguous sequences of previously seen factors, enabling—under greedy parsing—a compression ratio strictly superior to that of minimal grammar compression. LZSE is constructible in linear time and employs a compact index built upon the factor structure, requiring space linear in the compressed size. Experimental evaluation demonstrates that LZSE significantly outperforms grammar-based compressors in compression ratio while retaining efficient random access.
📝 Abstract
An LZ-like factorization of a string is a factorization in which each factor is either a single character or a copy of a substring that occurs earlier in the string. While grammar-based compression schemes support efficient random access with linear space in the size of the compressed representation, such methods are not known for general LZ-like factorizations. This has led to the development of restricted LZ-like schemes such as LZ-End [Kreft and Navarro, 2013] and height-bounded (LZHB) [Bannai et al., 2024], which trade off some compression efficiency for faster access. We introduce LZ-Start-End (LZSE), a new variant of LZ-like factorizations in which each copy factor refers to a contiguous sequence of preceding factors. By its nature, any context-free grammar can easily be converted into an LZSE factorization of equal size. Further, we study the greedy LZSE factorization, in which each copy factor is taken as long as possible. We show how the greedy LZSE factorization can be computed in linear time with respect to the input string length, and that there exists a family of strings for which the size of the greedy LZSE factorization is of strictly lower order than that of the smallest grammar. These imply that our LZSE scheme is stronger than grammar-based compressions in the context of repetitiveness measures. To support fast queries, we propose a data structure for LZSE-compressed strings that permits $O(log n)$-time random access within space linear in the compressed size, where $n$ is the length of the input string.