Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression

📅 2024-12-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address low throughput in long-context LLM inference under GPU memory constraints, this work pioneers the extension of KV cache compression to the input processing (prefill) phase. We propose a dynamic, multi-stage KV cache compression framework that jointly performs adaptive key-value pair compression and eviction during both prefill and decode stages, enabling co-optimization of memory footprint and computational efficiency. Our method supports larger batch sizes, significantly boosting throughput while strictly preserving original model accuracy. Experiments demonstrate up to a 2.3× increase in batch size and an average 1.9× improvement in end-to-end inference throughput for long-context workloads, with zero accuracy degradation. The core innovations are: (i) the first lossy yet controllable KV cache compression technique tailored for the prefill stage, and (ii) a unified cache lifecycle management mechanism that coordinates compression and eviction across prefill and generation phases.

Technology Category

Application Category

📝 Abstract

Several works have developed eviction policies to remove key-value (KV) pairs from the KV cache for more efficient inference. The focus has been on compressing the KV cache after the input prompt has been processed for faster token generation. In settings with limited GPU memory, and when the input context is longer than the generation length, we show that by also compressing the KV cache during the input processing phase, larger batch sizes can be used resulting in significantly higher throughput while still maintaining the original model's accuracy.

Problem

Research questions and friction points this paper is trying to address.

Compress KV cache during input processing phase

Enable larger batch sizes for higher throughput

Maintain model accuracy with limited GPU memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compress KV cache during input processing

Enable larger batch sizes

Maintain original model accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow