LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of linear memory growth in KV cache during long-context reasoning with large language models, which hinders scalability. Existing cache eviction methods based on importance scoring incur substantial prefill overhead due to their reliance on expensive draft generation. To overcome this limitation, the authors propose LookaheadKV—a lightweight, draft-free KV cache eviction framework that introduces parameter-efficient modules within Transformer layers to directly predict the true importance of cached tokens for future outputs, enabling prospective evaluation without explicit draft generation. For the first time, this approach accurately estimates KV importance while maintaining minimal computational overhead, significantly outperforming existing high-cost approximation methods. Experiments demonstrate that LookaheadKV surpasses state-of-the-art baselines across diverse long-context tasks, reduces cache eviction cost by up to 14.5×, and substantially accelerates first-token generation.

Technology Category

Application Category

📝 Abstract

Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

Problem

Research questions and friction points this paper is trying to address.

KV cache eviction

long-context inference

large language models

autoregressive generation

cache efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

LookaheadKV

KV cache eviction

draft-free lookahead