Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual memory bandwidth and capacity bottlenecks induced by KV caching in LLM inference, this paper proposes an online-offline collaborative hybrid quantization mechanism. Offline, outlier thresholds are pre-determined to eliminate runtime detection overhead; online, hardware-efficient dynamic mixed-bitwidth quantization is performed, integrated with a customized quantization engine and dedicated memory management unit. The method preserves high accuracy—averaging only 0.54% precision degradation—while significantly improving throughput: achieving a 1.58× speedup over NVIDIA A100 at batch size 256, outperforming existing KV quantization approaches. Its core innovation lies in the first-of-its-kind co-design of offline threshold calibration and lightweight online scaling, attaining Pareto-optimal trade-offs between accuracy and latency. Moreover, the solution is fully compatible with diverse LLM accelerators (e.g., LPUs) without modification.

Technology Category

Application Category

📝 Abstract
Modern Large Language Model serving system batches multiple requests to achieve high throughput, while batching attention operations is challenging, rendering memory bandwidth a critical bottleneck. The community relies on high-end GPUs with multiple high-bandwidth memory channels. Unfortunately, HBM's high bandwidth often comes at the expense of limited memory capacity, which reduces core utilization and increases costs. Recent advancements enabling longer contexts for LLMs have substantially increased the key-value cache size, further intensifying the pressures on memory capacity. The literature has explored KV cache quantization techniques, which commonly use low bitwidth for most values, selectively using higher bitwidth for outlier values. While this approach helps achieve high accuracy and low bitwidth simultaneously, it comes with the limitation that cost for online outlier detection is excessively high, negating the advantages. We propose Oaken, an acceleration solution that achieves high accuracy and high performance simultaneously through co-designing algorithm and hardware. To effectively find a sweet spot in the accuracy-performance trade-off space of KV cache quantization, Oaken employs an online-offline hybrid approach, setting outlier thresholds offline, which are then used to determine the quantization scale online. To translate the proposed algorithmic technique into tangible performance gains, Oaken also comes with custom quantization engines and memory management units that can be integrated with any LLM accelerators. We built an Oaken accelerator on top of an LLM accelerator, LPU, and conducted a comprehensive evaluation. Our experiments show that for a batch size of 256, Oaken achieves up to 1.58x throughput improvement over NVIDIA A100 GPU, incurring a minimal accuracy loss of only 0.54% on average, compared to state-of-the-art KV cache quantization techniques.
Problem

Research questions and friction points this paper is trying to address.

Addresses memory bandwidth bottleneck in LLM serving systems
Reduces high cost of online outlier detection in KV cache quantization
Improves throughput and accuracy trade-off in LLM acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online-offline hybrid KV cache quantization
Custom quantization engines integration
Offline outlier thresholds for online scaling
🔎 Similar Papers
No similar papers found.