LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address severe accuracy degradation and memory–throughput imbalance caused by 2-bit quantization of KV caches in large language model (LLM) inference, this paper proposes LogQuant—the first log-distribution-based 2-bit quantization method specifically designed for KV caches. Its core innovation is a global adaptive logarithmic distribution filtering mechanism that eliminates reliance on local importance heuristics or pattern prediction, enabling dynamic optimization of compression locations. LogQuant synergistically integrates logarithmic quantization with dynamic KV cache sparsification and is implemented as a lightweight plug-in compatible with Hugging Face Transformers. Experiments demonstrate that LogQuant achieves a 25% improvement in inference throughput and a 60% increase in maximum batch size—without any additional memory overhead—while boosting accuracy by 40%–200% on mathematical reasoning and code completion tasks.

Technology Category

Application Category

📝 Abstract

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

Problem

Research questions and friction points this paper is trying to address.

Develops 2-bit KV Cache quantization for LLM memory efficiency

Improves accuracy in Math and Code Completion tasks

Enhances throughput and batch size without extra memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Log-based filtering for KV Cache compression

2-bit quantization with superior accuracy

25% throughput and 60% batch size boost

🔎 Similar Papers

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization