EntmaxKV: Support-Aware Decoding for Entmax Attention

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
This work addresses the substantial memory overhead of key-value (KV) caching in long-context decoding, where existing softmax-based sparse methods suffer from non-zero probability mass loss due to truncation. The authors propose EntmaxKV, a framework that, for the first time, integrates the exact sparsity of entmax attention into the decoding phase. By leveraging query-aware page scoring, support-set-aware candidate selection, and Gaussian-informed threshold estimation, EntmaxKV adaptively determines a sparse budget and loads only the most critical KV pages. This approach avoids approximation-induced truncation, enabling accurate recovery of the entmax support set. Evaluated on contexts up to 1M tokens, EntmaxKV achieves near-full entmax performance with minimal KV cache usage, yielding speedups of up to 5.43× over full entmax attention and 3.36× over softmax baselines, while producing lower output error and better preserving essential tokens.
📝 Abstract
Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $δ$, showing that output error is controlled by $δ$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.
Problem

Research questions and friction points this paper is trying to address.

long-context decoding
KV-cache memory traffic
sparse decoding
entmax attention
probability mass truncation
Innovation

Methods, ideas, or system contributions that make the work stand out.

entmax attention
sparse decoding
KV-cache efficiency
support-aware selection
long-context generation
🔎 Similar Papers
No similar papers found.