Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the high memory footprint of KV cache in large language model inference, a challenge exacerbated by existing low-bit quantization methods that rely on calibration and suffer significant performance degradation. The authors propose a calibration-free KV cache compression technique that introduces the Hurwitz quaternion group into directional quantization for the first time. By leveraging the group’s product structure with randomly subsampled codebooks, the method performs spherical isometric mapping on block-wise K/V vectors and incorporates per-batch median-based scaling to handle outliers. Achieving near-FP16 performance at approximately 5 bits (with perplexity gaps of only 0.02–0.10), the approach outperforms int4 quantization by 3× to 1900× in speed. On Llama-3-70B, it compresses the 128k-context KV cache from 43 GB to 8.5 GB.

📝 Abstract

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

large language models

quantization

memory efficiency

calibration-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hurwitz quaternions

multiplicative quantization

calibration-free