FASA: Frequency-aware Sparse Attention

πŸ“… 2026-02-03
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the excessive memory overhead of KV caching in large language models when processing long contexts, a challenge inadequately resolved by existing token pruning methods that struggle to balance efficiency and information retention. The authors propose FASA, a novel framework that reveals, for the first time, the existence of predictable β€œdominant” frequency components (FCs) within Rotary Position Embedding (RoPE) that exhibit high alignment with full attention patterns. These FCs serve as a zero-computation proxy for token importance. Leveraging this insight, FASA enables query-aware dynamic pruning, performing attention computation only on a critical subset of tokens. Experiments demonstrate that FASA achieves near-full-KV performance on benchmarks like LongBench-V1 using just 256 tokens, and attains a 2.56Γ— speedup on AIME24 with only 18.9% cache overhead, significantly outperforming current state-of-the-art approaches.

Technology Category

Application Category

πŸ“ Abstract
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of"dominant"FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
KV cache
attention sparsity
token pruning
long-context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-aware Sparse Attention
KV cache pruning
RoPE frequency chunk
query-aware token eviction
long-context LLMs
πŸ”Ž Similar Papers
No similar papers found.