OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the substantial memory bandwidth and storage demands of key-value (KV) caching in long-context autoregressive inference. The authors propose a data-independent, online, and deterministic quantization method that jointly maps coordinate triplets into an octahedral parameter space via structured random rotations, subsequently transforming them into a square domain. By integrating Lloyd-Max quantization with mean-squared-error-optimized non-uniform bit allocation for both direction and magnitude, the approach achieves dimension-adaptive optimal compression. Notably, it introduces triplet-wise joint quantization and octahedral parameterization for the first time. Evaluated across text, video, and audio tasks, the method consistently outperforms existing rotation-based codecs at all bitrates—particularly under extreme compression—without incurring additional inference bandwidth or latency overhead.

📝 Abstract

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: https://octopus-quant.github.io/

Problem

Research questions and friction points this paper is trying to address.

KV cache

memory bandwidth

long-context inference

transformer compression

quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

octahedral parametrization

joint quantization

KV cache compression