KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
KV cache memory consumption in long-context LLM inference scales linearly with sequence length, posing a critical bottleneck for edge deployment. Method: This paper proposes KV Pareto, the first framework to systematically model and jointly optimize the multi-dimensional trade-offs among KV cache quantization, chunked prefilling, and weight quantization—automatically searching for model-specific Pareto-optimal configurations. It integrates int2/4/8 and mixed-precision KV quantization, multi-granularity (per-token/tensor/block) control, and AWQ 4-bit weight quantization across mainstream architectures (Qwen, Llama, Mistral). Contribution/Results: Experiments demonstrate 68–78% total memory reduction with only 1–3% accuracy degradation. Robustness is validated on Needle-in-a-Haystack, GSM8k, MMLU, and 128k-context benchmarks.

Technology Category

Application Category

📝 Abstract
Long-context Large Language Models (LLMs) face significant memory bottlenecks during inference due to the linear growth of key-value (KV) cache with sequence length. While individual optimization techniques like KV cache quantization, chunked prefill, and model weight quantization have shown promise, their joint effects and optimal configurations for edge deployment remain underexplored. We introduce KV Pareto, a systems-level framework that systematically maps the trade-off frontier between total memory consumption and task accuracy across these three complementary optimization techniques. Our framework evaluates multiple LLM architectures (Qwen, Llama, Mistral) with varying KV quantization schemes (int2/4/8, mixed-precision), granularities (per-token, per-tensor, per-block), and 4-bit weight quantization via AWQ. Our framework identifies model-specific Pareto-optimal configurations that achieve 68-78% total memory reduction with minimal (1-3%) accuracy degradation on long-context tasks. We additionally verify the selected frontiers on additional benchmarks of Needle-in-a-Haystack, GSM8k and MMLU as well as extended context lengths of up to 128k to demonstrate the practical need of joint optimization for efficient LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Optimizes KV cache and model compression for long-context LLM inference.
Explores joint effects of quantization techniques to reduce memory bottlenecks.
Identifies Pareto-optimal configurations balancing memory and accuracy trade-offs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint optimization of KV cache and weight quantization
Systems-level framework mapping memory-accuracy trade-offs
Model-specific Pareto-optimal configurations for edge deployment
🔎 Similar Papers