🤖 AI Summary
To address the lack of systematic GPU optimization for Bloom filters—resulting in suboptimal throughput and high false-positive rates—this paper introduces the first co-designed optimization framework tailored to modern GPU architectures. Our method integrates warp-level vectorization, intra-cache-domain thread collaboration, and computation-memory latency hiding. Departing from CPU-centric design paradigms, it achieves 92% of theoretical peak performance on the NVIDIA B200 GPU. Compared to state-of-the-art (SOTA) implementations at equivalent false-positive rates, our approach delivers 11.35× higher lookup throughput and 15.4× faster construction throughput. Crucially, it breaks the longstanding speed–accuracy trade-off inherent in GPU-based Bloom filters, approaching the theoretical lower bound on false positives. Implemented modularly in CUDA/C++, the solution supports billions of approximate membership queries per second. The source code will be publicly released.
📝 Abstract
Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs, with massive thread-level parallelism and high-bandwidth memory, are a natural fit for accelerating these Bloom filter variants potentially to billions of operations per second. Although CPU-optimized implementations have been well studied, GPU designs remain underexplored. We close this gap by exploring the design space on GPUs along three dimensions: vectorization, thread cooperation, and compute latency.
Our evaluation shows that the combination of these optimization points strongly affects throughput, with the largest gains achieved when the filter fits within the GPU's cache domain. We examine how the hardware responds to different parameter configurations and relate these observations to measured performance trends. Crucially, our optimized design overcomes the conventional trade-off between speed and precision, delivering the throughput typically restricted to high-error variants while maintaining the superior accuracy of high-precision configurations. At iso error rate, the proposed method outperforms the state-of-the-art by $11.35 imes$ ($15.4 imes$) for bulk filter lookup (construction), respectively, achieving above $92%$ of the practical speed-of-light across a wide range of configurations on a B200 GPU. We propose a modular CUDA/C++ implementation, which will be openly available soon.