π€ AI Summary
Traditional GPU concurrent data structures struggle to simultaneously achieve efficient queries and flexible updates under dynamic workloads due to high indexing overhead, warp divergence, and contention. This work proposes FliX, which introduces a novel βcompute-to-bucketβ mapping paradigm that eliminates explicit index layers by binding warps directly to data buckets. FliX replaces index traversal with binary search over batched operations, enabling lock-free dynamic memory management and order maintenance. The design yields substantial performance gains: query latency is reduced by 6.5Γ over state-of-the-art GPU B-trees and 1.5Γ over LSM trees; per-byte throughput reaches four times that of existing ordered structures; and under insert-intensive workloads, FliX outperforms the best-known dynamic ordered baselines by more than 8Γ.
π Abstract
GPU-based concurrent data structures (CDSs) achieve high throughput for read-only queries, but efficient support for dynamic updates on fully GPU-resident data remains challenging. Ordered CDSs (e.g., B-trees and LSM-trees) maintain an index layer that directs operations to a data layer (buckets or leaves), while hash tables avoid the cost of maintaining order but do not support range or successor queries. On GPUs, maintaining and traversing an index layer under frequent updates introduces contention and warp divergence.
To tackle these problems, we flip the indexing paradigm on its head with FliX, a comparison-based, flipped indexing strategy for dynamic, fully GPU-resident CDSs. Traditional GPU CDSs typically take a batch of operations and assign each operation to a GPU thread or warp. FliX, however, assigns compute (e.g., a warp) to each bucket in the data layer, and each bucket then locates operations it is responsible for in the batch. FliX can replace many index layer traversals with a single binary search on the batch, reducing redundant work and warp divergence. Further, FliX simplifies updates as no index layer must be maintained.
In our experiments, FliX achieves 6.5x reduced query latency compared to a leading GPU B-tree and 1.5x compared to a leading GPU LSM-tree, while delivering 4x higher throughput per memory footprint than ordered competitors. Despite maintaining order, FliX also surpasses state-of-the-art unordered GPU hash tables in query and deletion performance, and is highly competitive in insertion performance. In update-heavy workloads, it outperforms the closest fully dynamic ordered baseline by over 8x in insertion throughput while supporting dynamic memory reclamation. These results suggest that eliminating the index layer and adopting a compute-to-bucket mapping can enable practical, fully dynamic GPU indexing without sacrificing query performance.