SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

240K/year
🤖 AI Summary
This work addresses the high memory footprint of KV caches in large language model inference and the challenge that existing compression methods struggle to simultaneously satisfy deployment constraints and preserve model accuracy. The authors propose a system-aware 4-bit KV cache quantization scheme that integrates lightweight block-diagonal Hadamard rotation with per-token INT4 quantization. They design a fused rotation-quantization operator tailored to practical deployment requirements, including paged memory layouts and regular memory access patterns. Evaluated across multiple models and benchmarks, the method nearly recovers full-precision accuracy with no end-to-end latency overhead and achieves throughput comparable to standard INT4 quantization, enabling highly efficient and low-loss KV cache compression.

Technology Category

Application Category

📝 Abstract
KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design--token-wise INT4 quantization with block-diagonal Hadamard rotation--consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such as vector quantization and Hessian-aware quantization provide only marginal additional gains once serving compatibility is taken into account. To make this practical, we implement a fused rotation-quantization kernel that integrates directly into paged KV-cache layouts and introduces zero measurable end-to-end overhead, matching plain INT4 throughput across concurrency levels. Our results show that effective KV-cache compression is fundamentally a systems co-design problem: under real serving constraints, lightweight block-diagonal Hadamard rotation is a viable method that delivers near-lossless accuracy without sacrificing serving efficiency.
Problem

Research questions and friction points this paper is trying to address.

KV-cache
quantization
LLM serving
memory bottleneck
system constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache quantization
INT4
Hadamard rotation
system-aware optimization
LLM serving