Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

KV caching constitutes a critical memory bottleneck in large language model (LLM) inference; while 2-bit quantization offers substantial compression, it often incurs significant accuracy degradation—especially for long-context scenarios. This paper proposes an algorithm-system co-designed 2-bit KV quantization framework: it introduces a dynamic channel-wise precision enhancement mechanism, integrated with unified tensor decomposition and page-centric layout, enabling mask-free, non-scattered mixed-precision caching. We develop Triton-compatible lightweight dequantization kernels and a page-aligned KV layout, coupled with a low-overhead runtime pipeline that preserves memory locality and efficiency. Evaluated on Qwen3 and LLaMA3, our method achieves an 8× KV cache compression ratio with negligible accuracy loss, enables 8× larger batch sizes, and delivers 2.1–4.1× higher throughput—significantly advancing long-context inference performance.

Technology Category

Application Category

📝 Abstract

The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm-system co-design for mixed-precision KV caching: Kitty. On the algorithm side, extensive experiments show that Dynamic Channel-wise Precision Boost -- which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision -- maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. Kitty addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), Kitty cuts KV memory by nearly 8x with negligible accuracy loss, enabling up to 8x larger batches and 2.1x-4.1x higher throughput under the same memory budget. We release the full implementation of Kitty at https://github.com/Summer-Summer/Kitty.

Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache memory bottleneck in LLM inference

Maintaining accuracy with 2-bit quantization for long contexts

Handling dynamic precision boosts while preserving system efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic channel-wise precision boost for KV cache

Page-centric layout with unified 2-bit precision tensors

Triton-compatible dequantization kernels preserving memory coalescing

🔎 Similar Papers

No similar papers found.