MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the high memory consumption and latency overhead induced by KV caches in long-chain chain-of-thought reasoning, this paper proposes a query-aware mixed-precision quantization method. The core innovation lies in the first joint modeling of intrinsic quantization difficulty and query relevance per key channel, enabling dynamic, fine-grained retention of critical channels at high precision; meanwhile, value caches are quantized token-wise at low bit-widths. The method comprises three components: (i) a lightweight query-aware importance estimator, (ii) a channel-level mixed-precision quantization policy, and (iii) a plug-and-play deployment architecture. Evaluated across diverse complex reasoning benchmarks, the approach preserves full-precision model performance while reducing KV cache memory by over 60%, significantly outperforming existing low-bit quantization baselines.

Technology Category

Application Category

📝 Abstract

Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.

Problem

Research questions and friction points this paper is trying to address.

Addresses memory and latency overhead from KV cache in long-context reasoning.

Overcomes performance degradation in low-bit KV cache quantization for complex tasks.

Identifies and preserves critical key channels requiring higher precision via query-aware algorithm.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-aware mixed-precision KV cache quantization

Lightweight algorithm identifies critical key channels

Per-token quantization applied for value cache

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference