AutoCSF: Provably Space-Efficient Indexing of Skewed Key-Value Workloads via Filter-Augmented Compressed Static Functions

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of constructing space-efficient in-memory key-value indexes for billion-scale k-mer count tables in computational genomics, which are typically highly skewed and dominated by a single high-frequency value. The authors propose AutoCSF, an algorithm that integrates compressed static functions (CSF) with a pre-filtering mechanism, underpinned by a mathematically rigorous criterion to evaluate filter gain. This framework enables seamless integration of set-membership data structures such as Bloom filters and provides theoretical guarantees on overall space usage. Experimental results demonstrate that AutoCSF achieves significantly better memory efficiency than existing baseline methods while maintaining low query latency.

Technology Category

Application Category

📝 Abstract
We study the problem of building space-efficient, in-memory indexes for massive key-value datasets with highly skewed value distributions. This challenge arises in many data-intensive domains and is particularly acute in computational genomics, where $k$-mer count tables can contain billions of entries dominated by a single frequent value. While recent work has proposed to address this problem by augmenting compressed static functions (CSFs) with pre-filters, existing approaches rely on complex heuristics and lack formal guarantees. In this paper, we introduce a principled algorithm, called AutoCSF, for combining CSFs with pre-filtering to provably handle skewed distributions with near-optimal space usage. We improve upon prior CSF pre-filtering constructions by (1) deriving a mathematically rigorous decision criterion for when filter augmentation is beneficial; (2) presenting a general algorithmic framework for integrating CSFs with modern set membership data structures beyond the classic Bloom filter; and (3) establishing theoretical guarantees on the overall space usage of the resulting indexes. Our open-source implementation of AutoCSF demonstrates space savings over baseline methods while maintaining low query latency.
Problem

Research questions and friction points this paper is trying to address.

space-efficient indexing
skewed key-value workloads
compressed static functions
k-mer count tables
in-memory indexes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed Static Functions
Pre-filtering
Space Efficiency
Skewed Distributions
Set Membership Data Structures
🔎 Similar Papers
No similar papers found.
D
David Torres Ramos
Distill
V
Vihan Lakshman
MIT CSAIL
Chen Luo
Chen Luo
Amazon Search
T
Todd Treangen
Rice University
Benjamin Coleman
Benjamin Coleman
Google DeepMind
Machine LearningData Structures and Algorithms