Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenge of supporting long-range mathematical reasoning under extremely limited KV cache budgets, where existing compression methods fall short. The authors systematically evaluate seven compression mechanisms across five design dimensions and propose α, a lightweight enhancement that modifies only the scoring function. Specifically, α replaces the standard top-k selection with a greedy facility location strategy that incorporates redundancy penalties in the value (V) space, controlled by a single diversity weight. Without altering model architecture, α consistently outperforms multiple heavyweight structural redesign approaches on both Qwen-7B (with cache budget b=128) and Llama-8B (b=64). Statistical analysis with Bonferroni correction confirms no significant negative effects across evaluated tasks.

📝 Abstract

KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.

Problem

Research questions and friction points this paper is trying to address.

KV-cache compression

minimal intervention

long-form mathematical reasoning

cache budget

retention scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache compression

minimal intervention

redundancy penalty