CAOTE: KV Caching through Attention Output Error based Token Eviction

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the excessive memory and computational overhead of KV caching in long-context inference of large language models on resource-constrained devices, this paper proposes a token eviction mechanism grounded in attention output reconstruction error. Unlike conventional approaches relying solely on attention scores, our method explicitly incorporates value vector information into importance estimation, guiding eviction decisions by quantifying each cached token’s actual contribution to the final attention output. The proposed mechanism is a lightweight, plug-and-play meta-heuristic module compatible with any existing eviction strategy and requires only post-training optimization. Experiments across multiple downstream tasks demonstrate significant improvements in long-context inference accuracy, validating the effectiveness and strong generalizability of value-vector-aware importance modeling for KV cache compression.

Technology Category

Application Category

📝 Abstract

While long context support of large language models has extended their abilities, it also incurs challenges in memory and compute which becomes crucial bottlenecks in resource-restricted devices. Token eviction, a widely adopted post-training methodology designed to alleviate the bottlenecks by evicting less important tokens from the cache, typically uses attention scores as proxy metrics for token importance. However, one major limitation of attention score as a token-wise importance metrics is that it lacks the information about contribution of tokens to the attention output. In this paper, we propose a simple eviction criterion based on the contribution of cached tokens to attention outputs. Our method, CAOTE, optimizes for eviction error due to token eviction, by seamlessly integrating attention scores and value vectors. This is the first method which uses value vector information on top of attention-based eviction scores. Additionally, CAOTE can act as a meta-heuristic method with flexible usage with any token eviction method. We show that CAOTE, when combined with the state-of-the-art attention score-based methods, always improves accuracies on the downstream task, indicating the importance of leveraging information from values during token eviction process.

Problem

Research questions and friction points this paper is trying to address.

Optimizes token eviction using attention output error

Improves KV caching efficiency in resource-limited devices

Integrates attention scores and value vectors for eviction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses attention output error for token eviction

Integrates attention scores and value vectors

Acts as meta-heuristic with flexible usage

🔎 Similar Papers

D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models