🤖 AI Summary
This work addresses the limitation of large language models in long-context reasoning, where the KV cache grows linearly with sequence length, and existing heuristic eviction strategies struggle to accurately model token importance while causing irreversible information loss. The authors propose a learnable indexer that dynamically predicts KV importance, coupled with a lightweight latent memory module that online-compresses evicted tokens into compact states. A residual retrieval mechanism compensates for attention loss by reconstructing relevant information from this compressed memory. This approach uniquely integrates learnable importance prediction with an online-updated latent memory, significantly enhancing inference accuracy and stability under a fixed KV budget. Experiments demonstrate consistent improvements over state-of-the-art methods across RULER (4K/16K), LongBench, and compression curves on Qwen, Mistral, and Llama models, achieving gains of up to 25 points under aggressive compression.
📝 Abstract
Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.