TokenButler: Token Importance is Predictable

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

Large language models (LLMs) suffer from memory and computational bottlenecks as their key-value cache (KV-Cache) scales quadratically with context length. Existing sparsification methods either degrade generation quality or rely on coarse-grained page-level retrieval and inaccurate importance proxies, failing to balance efficiency and fidelity. To address this, we propose Query-Aware Token Importance Prediction (Q-TIP): a lightweight (<1.2% parameter overhead), dynamic, query-aware, fine-grained token importance prediction mechanism that identifies critical tokens in real time at each decoding step. Evaluated on a novel synthetic coreference retrieval benchmark, Q-TIP outperforms state-of-the-art methods by over 8% in perplexity and downstream task accuracy, while approaching oracle performance on coreference retrieval. To our knowledge, Q-TIP is the first method to achieve context-aware, high-precision, low-overhead adaptive KV-Cache compression.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity&downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: https://github.com/abdelfattah-lab/TokenButler

Problem

Research questions and friction points this paper is trying to address.

Addresses KV-Cache memory and computation bottlenecks in LLMs.

Identifies dynamic, input-dependent critical tokens for efficient decoding.

Improves token importance prediction accuracy and downstream task performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-granularity query-aware token predictor

Light-weight predictor with minimal parameter overhead

Improves perplexity and accuracy by 8%

🔎 Similar Papers

No similar papers found.