🤖 AI Summary
This work addresses the challenge of constructing space-efficient in-memory key-value indexes for billion-scale k-mer count tables in computational genomics, which are typically highly skewed and dominated by a single high-frequency value. The authors propose AutoCSF, an algorithm that integrates compressed static functions (CSF) with a pre-filtering mechanism, underpinned by a mathematically rigorous criterion to evaluate filter gain. This framework enables seamless integration of set-membership data structures such as Bloom filters and provides theoretical guarantees on overall space usage. Experimental results demonstrate that AutoCSF achieves significantly better memory efficiency than existing baseline methods while maintaining low query latency.
📝 Abstract
We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where $k$-mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general algorithmic framework for integrating CSFs with modern set membership data structures beyond the classic Bloom filter; and (3) establishing theoretical guarantees on the overall space usage of the resulting indexes. Our open-source implementation of AutoCSF demonstrates space savings over baseline methods while maintaining low query latency.