OnPair: Short Strings Compression for Fast Random Access

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Memory-resident databases require compression schemes that simultaneously achieve high compression ratios and low-latency random access—yet existing methods (e.g., BPE, FSST) struggle to balance both. This paper introduces OnPair, a cache-friendly dictionary-based compression algorithm tailored for in-memory databases. Its core innovation is a single-pass, incremental substring merging mechanism that constructs a dictionary of frequent adjacent substrings without global position tracking; the OnPair16 variant further constrains dictionary entries to 16 bytes and employs optimized longest-prefix matching for efficient parsing. Experiments on real-world datasets show that OnPair matches BPE’s compression ratio while accelerating compression by up to 3.2× and reducing memory overhead by 40–65%. It significantly outperforms state-of-the-art alternatives, achieving superior trade-offs among compression efficiency, random-access performance, and resource consumption.

Technology Category

Application Category

📝 Abstract

We present OnPair, a dictionary-based compression algorithm designed to meet the needs of in-memory database systems that require both high compression and fast random access. Existing methods either achieve strong compression ratios at significant computational and memory cost (e.g., BPE) or prioritize speed at the expense of compression quality (e.g., FSST). OnPair bridges this gap by employing a cache-friendly dictionary construction technique that incrementally merges frequent adjacent substrings in a single sequential pass over a data sample. This enables fast, memory-efficient training without tracking global pair positions, as required by traditional BPE. We also introduce OnPair16, a variant that limits dictionary entries to 16 bytes, enabling faster parsing via optimized longest prefix matching. Both variants compress strings independently, supporting fine-grained random access without block-level overhead. Experiments on real-world datasets show that OnPair and OnPair16 achieve compression ratios comparable to BPE while significantly improving compression speed and memory usage.

Problem

Research questions and friction points this paper is trying to address.

Bridges gap between high compression and fast random access

Reduces computational and memory costs for string compression

Enables fine-grained random access without block-level overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cache-friendly dictionary construction technique

Incremental merging of frequent adjacent substrings

Optimized longest prefix matching for fast parsing

🔎 Similar Papers

No similar papers found.