🤖 AI Summary
To address severe accuracy degradation and memory–throughput imbalance caused by 2-bit quantization of KV caches in large language model (LLM) inference, this paper proposes LogQuant—the first log-distribution-based 2-bit quantization method specifically designed for KV caches. Its core innovation is a global adaptive logarithmic distribution filtering mechanism that eliminates reliance on local importance heuristics or pattern prediction, enabling dynamic optimization of compression locations. LogQuant synergistically integrates logarithmic quantization with dynamic KV cache sparsification and is implemented as a lightweight plug-in compatible with Hugging Face Transformers. Experiments demonstrate that LogQuant achieves a 25% improvement in inference throughput and a 60% increase in maximum batch size—without any additional memory overhead—while boosting accuracy by 40%–200% on mathematical reasoning and code completion tasks.
📝 Abstract
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.