Optimizing Bloom Filters for Modern GPU Architectures

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of systematic GPU optimization for Bloom filters—resulting in suboptimal throughput and high false-positive rates—this paper introduces the first co-designed optimization framework tailored to modern GPU architectures. Our method integrates warp-level vectorization, intra-cache-domain thread collaboration, and computation-memory latency hiding. Departing from CPU-centric design paradigms, it achieves 92% of theoretical peak performance on the NVIDIA B200 GPU. Compared to state-of-the-art (SOTA) implementations at equivalent false-positive rates, our approach delivers 11.35× higher lookup throughput and 15.4× faster construction throughput. Crucially, it breaks the longstanding speed–accuracy trade-off inherent in GPU-based Bloom filters, approaching the theoretical lower bound on false positives. Implemented modularly in CUDA/C++, the solution supports billions of approximate membership queries per second. The source code will be publicly released.

Technology Category

Application Category

📝 Abstract
Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs, with massive thread-level parallelism and high-bandwidth memory, are a natural fit for accelerating these Bloom filter variants potentially to billions of operations per second. Although CPU-optimized implementations have been well studied, GPU designs remain underexplored. We close this gap by exploring the design space on GPUs along three dimensions: vectorization, thread cooperation, and compute latency. Our evaluation shows that the combination of these optimization points strongly affects throughput, with the largest gains achieved when the filter fits within the GPU's cache domain. We examine how the hardware responds to different parameter configurations and relate these observations to measured performance trends. Crucially, our optimized design overcomes the conventional trade-off between speed and precision, delivering the throughput typically restricted to high-error variants while maintaining the superior accuracy of high-precision configurations. At iso error rate, the proposed method outperforms the state-of-the-art by $11.35 imes$ ($15.4 imes$) for bulk filter lookup (construction), respectively, achieving above $92%$ of the practical speed-of-light across a wide range of configurations on a B200 GPU. We propose a modular CUDA/C++ implementation, which will be openly available soon.
Problem

Research questions and friction points this paper is trying to address.

Optimizing Bloom filters for GPU architectures to enhance performance
Exploring GPU design space through vectorization, thread cooperation, and latency
Overcoming speed-precision trade-off in GPU-accelerated Bloom filter implementations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vectorization, thread cooperation, compute latency optimization
Overcomes speed-precision trade-off for high throughput
Modular CUDA/C++ implementation for GPU Bloom filters
🔎 Similar Papers
No similar papers found.
D
Daniel Jünger
NVIDIA Corporation, Santa Clara, USA
K
Kevin Kristensen
University of Wisconsin-Madison, Madison, USA
Y
Yunsong Wang
NVIDIA Corporation, Santa Clara, USA
Xiangyao Yu
Xiangyao Yu
University of Wisconsin-Madison
DatabasesComputer Architecture
B
Bertil Schmidt
Johannes Gutenberg University, Mainz, Germany