🤖 AI Summary
This work addresses severe memory inefficiency in conventional GPU hash tables when embedding tables exceed the capacity of a single GPU’s high-bandwidth memory (HBM), as these structures retain all key-value pairs regardless of access patterns. To overcome this limitation, the authors propose HierarchicalKV—the first GPU hash table that treats caching semantics as a first-class operation. It replaces traditional dictionary semantics with a policy-driven eviction mechanism that either updates entries in place or rejects insertions, thereby avoiding costly rehashing and overflow failures. Key innovations include cache-line-aligned buckets, inline score-driven upserts, dynamic dual-bucket selection, three-level concurrency control, and a hierarchical key-value separation architecture. Evaluated on an NVIDIA H100 NVL, HierarchicalKV achieves up to 3.9 billion key-value operations per second, maintains load factors between 0.50 and 1.00 with less than 5% throughput variation, outperforms WarpCore by 1.4×, and surpasses indirect-addressing baselines by 2.6–9.4×, with integration already adopted in multiple open-source recommendation frameworks.
📝 Abstract
Traditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.