🤖 AI Summary
This work addresses the computational bottleneck in long-context reasoning with large language models, where the KV cache grows linearly with sequence length. Existing Top-k sparse attention methods struggle to accurately assess cache importance under both training-free and sparse-aware training settings. To overcome this, the authors propose UNIQUE, the first unified Top-k sparse attention framework applicable across both scenarios. UNIQUE introduces a lightweight yet effective importance scoring mechanism based on page-level mean and standard deviation of KV entries, along with a sigmoid soft mask derived from top-k boundary thresholds for sparse-aware training—eliminating the need for architectural modifications or auxiliary losses while bridging the training-inference gap. Experiments demonstrate lossless performance on LongBench Pro and long-form speech recognition tasks, achieving up to 11.4× speedup over FlashInfer dense attention and at least 5.3× end-to-end decoding acceleration compared to vLLM.
📝 Abstract
Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both training-free use and sparsity-aware training, remains challenging. This paper proposes UNIQUE, a universal top-k sparse attention framework that addresses both requirements and stays consistently effective across LLM modalities. UNIQUE operates at the granularity of KV pages and estimates per-page importance with a simple yet accurate score combining the mean of the page's keys as a representative vector with their standard deviation as an offset term. To further close the train-inference gap, this paper introduces a soft-mask sparsity-aware training scheme that uses the top-k score boundary as a per-query threshold and a sigmoid soft mask around it, requiring neither auxiliary losses nor architectural changes. Experiments on text and speech LLMs show that UNIQUE preserves task performance on long-context benchmarks such as LongBench Pro and on long-form speech recognition, while delivering up to 11.4x attention-kernel speedup over FlashInfer dense attention and at least 5.3x end-to-end decoding speedup over a vLLM-based dense model.