🤖 AI Summary
This work addresses the challenge of linear memory growth in KV cache during long-context reasoning with large language models, a limitation exacerbated by existing compression methods that overlook prompt-dependent compression risks and functional heterogeneity across attention heads, leading to unstable performance. To this end, the authors propose CompilerKV, a risk-adaptive, head-aware KV compression framework. It employs an offline contextual bandit to learn reliability differences among attention heads and constructs a lightweight decision table. Prompt-level compression risk is modeled via attention entropy and local perplexity to dynamically generate retention thresholds. Notably, CompilerKV operates solely during the prefill phase without altering the model architecture. Evaluated on LongBench with a 512-token cache budget, it recovers 97.7% of FullKV performance, achieving up to a 5.2-point improvement over state-of-the-art methods.
📝 Abstract
Large Language Models (LLMs) in long-context scenarios are severely constrained by the linear growth of Key-Value (KV) cache memory. Existing KV compression methods rely either on static thresholds and attention-only heuristics or on coarse memory budget allocation. Under tight memory budgets, these methods overlook two key factors: prompt-dependent variation in compression risk and functional heterogeneity across attention heads, which destabilize token selection and lead to tail failures. To address these challenges, we propose CompilerKV, a risk-adaptive and head-aware compression framework that compiles offline experience into reusable decision tables for prefill-only deployment. CompilerKV integrates two key synergistic components: (i) a Head Heterogeneity Table, learned via offline contextual bandits, which assigns head-specific reliability weights to govern functional differences across attention heads explicitly; and (ii) a Risk-Adaptive Threshold Gating mechanism that jointly models attention entropy and local perplexity, transforming prompt-level risk into deployable retention thresholds. Experiments on LongBench show CompilerKV dominates SOTA methods under a 512-token budget, recovering 97.7\% of FullKV performance while achieving up to +5.2 points gain over the strongest competitor.