HierarchicalKV: A GPU Hash Table with Cache Semantics for Continuous Online Embedding Storage

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses severe memory inefficiency in conventional GPU hash tables when embedding tables exceed the capacity of a single GPU’s high-bandwidth memory (HBM), as these structures retain all key-value pairs regardless of access patterns. To overcome this limitation, the authors propose HierarchicalKV—the first GPU hash table that treats caching semantics as a first-class operation. It replaces traditional dictionary semantics with a policy-driven eviction mechanism that either updates entries in place or rejects insertions, thereby avoiding costly rehashing and overflow failures. Key innovations include cache-line-aligned buckets, inline score-driven upserts, dynamic dual-bucket selection, three-level concurrency control, and a hierarchical key-value separation architecture. Evaluated on an NVIDIA H100 NVL, HierarchicalKV achieves up to 3.9 billion key-value operations per second, maintains load factors between 0.50 and 1.00 with less than 5% throughput variation, outperforms WarpCore by 1.4×, and surpasses indirect-addressing baselines by 2.6–9.4×, with integration already adopted in multiple open-source recommendation frameworks.

Technology Category

Application Category

📝 Abstract
Traditional GPU hash tables preserve every inserted key -- a dictionary assumption that wastes scarce High Bandwidth Memory (HBM) when embedding tables routinely exceed single-GPU capacity. We challenge this assumption with cache semantics, where policy-driven eviction is a first-class operation. We introduce HierarchicalKV (HKV), the first general-purpose GPU hash table library whose normal full-capacity operating contract is cache-semantic: each full-bucket upsert (update-or-insert) is resolved in place by eviction or admission rejection rather than by rehashing or capacity-induced failure. HKV co-designs four core mechanisms -- cache-line-aligned buckets, in-line score-driven upsert, score-based dynamic dual-bucket selection, and triple-group concurrency -- and uses tiered key-value separation as a scaling enabler beyond HBM. On an NVIDIA H100 NVL GPU, HKV achieves up to 3.9 billion key-value pairs per second (B-KV/s) find throughput, stable across load factors 0.50-1.00 (<5% variation), and delivers 1.4x higher find throughput than WarpCore (the strongest dictionary-semantic GPU baseline at lambda=0.50) and up to 2.6-9.4x over indirection-based GPU baselines. Since its open-source release in October 2022, HKV has been integrated into multiple open-source recommendation frameworks.
Problem

Research questions and friction points this paper is trying to address.

GPU hash table
embedding storage
cache semantics
High Bandwidth Memory
online embedding
Innovation

Methods, ideas, or system contributions that make the work stand out.

cache semantics
GPU hash table
embedding storage
eviction policy
key-value separation
🔎 Similar Papers
No similar papers found.
Haidong Rong
Haidong Rong
Senior HPC Specialist, Distributed Algorithm Specialist, Recommenders System Specialist, @NVIDIA
Senior Sys. Software EngineerOpen-source tech lead
J
Jiashu Yao
NVIDIA
M
Matthias Langer
NVIDIA
S
Shijie Liu
NVIDIA
L
Li Fan
Tencent
D
Dongxin Wang
Vipshop
J
Jia He
BOSS Zhipin
Jinglin Chen
Jinglin Chen
University of Illinois Urbana-Champaign
Reinforcement LearningMachine Learning
J
Jiaheng Rang
ByteDance
J
Julian Qian
Snap
M
Mengyao Xu
NVIDIA
F
Fan Yu
NVIDIA
Minseok Lee
Minseok Lee
NVIDIA
Parallel ComputingCUDA ProgrammingRecommender SystemDeep LearningComputer Architecture
Z
Zehuan Wang
NVIDIA
E
Even Oldridge
NVIDIA