LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe accuracy degradation and memory–throughput imbalance caused by 2-bit quantization of KV caches in large language model (LLM) inference, this paper proposes LogQuant—the first log-distribution-based 2-bit quantization method specifically designed for KV caches. Its core innovation is a global adaptive logarithmic distribution filtering mechanism that eliminates reliance on local importance heuristics or pattern prediction, enabling dynamic optimization of compression locations. LogQuant synergistically integrates logarithmic quantization with dynamic KV cache sparsification and is implemented as a lightweight plug-in compatible with Hugging Face Transformers. Experiments demonstrate that LogQuant achieves a 25% improvement in inference throughput and a 60% increase in maximum batch size—without any additional memory overhead—while boosting accuracy by 40%–200% on mathematical reasoning and code completion tasks.

Technology Category

Application Category

📝 Abstract
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.
Problem

Research questions and friction points this paper is trying to address.

Develops 2-bit KV Cache quantization for LLM memory efficiency
Improves accuracy in Math and Code Completion tasks
Enhances throughput and batch size without extra memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Log-based filtering for KV Cache compression
2-bit quantization with superior accuracy
25% throughput and 60% batch size boost
🔎 Similar Papers
No similar papers found.
H
Han Chen
School of Computing, National University of Singapore
Zicong Jiang
Zicong Jiang
PhD student at Chalmers University of Technology
Communication SystemsOptical fiber communication and sensingMachine learningGenerative AI
Z
Zining Zhang
School of Computing, National University of Singapore
B
Bingsheng He
School of Computing, National University of Singapore
P
Pingyi Luo
4Paradigm
Mian Lu
Mian Lu
4Paradigm Technology
machine learning systemsGPGPUhigh performance computing
Y
Yuqiang Chen
4Paradigm