MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory consumption and latency overhead induced by KV caches in long-chain chain-of-thought reasoning, this paper proposes a query-aware mixed-precision quantization method. The core innovation lies in the first joint modeling of intrinsic quantization difficulty and query relevance per key channel, enabling dynamic, fine-grained retention of critical channels at high precision; meanwhile, value caches are quantized token-wise at low bit-widths. The method comprises three components: (i) a lightweight query-aware importance estimator, (ii) a channel-level mixed-precision quantization policy, and (iii) a plug-and-play deployment architecture. Evaluated across diverse complex reasoning benchmarks, the approach preserves full-precision model performance while reducing KV cache memory by over 60%, significantly outperforming existing low-bit quantization baselines.

Technology Category

Application Category

📝 Abstract
Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.
Problem

Research questions and friction points this paper is trying to address.

Addresses memory and latency overhead from KV cache in long-context reasoning.
Overcomes performance degradation in low-bit KV cache quantization for complex tasks.
Identifies and preserves critical key channels requiring higher precision via query-aware algorithm.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-aware mixed-precision KV cache quantization
Lightweight algorithm identifies critical key channels
Per-token quantization applied for value cache
🔎 Similar Papers
No similar papers found.
T
Tao Zhang
South China University of Technology, China
Ziqian Zeng
Ziqian Zeng
Associate Professor at South China University of Technology
Natural Language Processing
H
Hao Peng
Beihang University, China
Huiping Zhuang
Huiping Zhuang
Associate Professor, South China University of Technology
Continual LearningMulti-ModalEmbodied AILarge Model
C
Cen Chen
South China University of Technology, China, Pazhou Laboratory, China