🤖 AI Summary
This work addresses the high bandwidth and storage overhead in large language model inference caused by frequent accesses to the key-value (KV) cache, as well as the low resource utilization of existing HBM-PIM architectures. The authors propose a vertically heterogeneous HBM-PIM architecture that leverages HBM4 logic dies to construct a stack comprising high-density memory layers and processing-in-memory (PIM) compute layers, coordinated by a logic base die for cross-layer data migration and attention computation. The design incorporates topology-aware KV placement, bounded replication, inline quantization, and workload-aware eviction strategies to enable efficient, host-overhead-free KV cache management. Evaluated across four models, the proposed architecture achieves a 1.62× geometric mean throughput improvement and 1.70× higher SLO-compliant serving capacity compared to AttAcc, while reducing per-token energy consumption by 30%–47%.
📝 Abstract
Large language model (LLM) serving is now limited by the key-value (KV) cache. During decode, each new token rereads prior KV state, so attention becomes a bandwidth- and capacity-heavy memory task. HBM-PIM helps by moving attention closer to memory, but current stack organizations still waste resources. In practice, only hot KV blocks benefit from near-memory compute. Weights, activations, and cold KV mainly need dense storage and GPU-visible bandwidth. A uniform HBM-PIM stack makes all layers pay for PIM logic, while a dedicated-PIM design such as AttAcc recovers capacity but shrinks the HBM bandwidth left for GPU-side work. We propose TokenStack, a vertically heterogeneous HBM-PIM architecture for KV-centric LLM serving that leverages HBM4's logic-die substrate. TokenStack separates each stack into dense capacity layers and PIM-enabled compute layers, then uses the logic base die as a stack-local control point that manages cross-layer movement without host-side overhead. The base-die controller handles cross-layer DMA, layered address translation, attention-side gather/broadcast coordination, and inline quantization during migration. On top of this hardware, TokenStack uses topology-aware KV placement, workload-aware eviction, and bounded replication to keep hot KV near PIM compute while moving colder state to dense layers. Using production-derived traces across four models, completed multi-QPS runs show that TokenStack increases geometric-mean token throughput by 1.62x and SLO-compliant serving capacity by 1.70x over AttAcc, and reduces per-token energy by 30-47%.