π€ AI Summary
This work addresses the excessive memory overhead of KV caching in large language models when processing long contexts, a challenge inadequately resolved by existing token pruning methods that struggle to balance efficiency and information retention. The authors propose FASA, a novel framework that reveals, for the first time, the existence of predictable βdominantβ frequency components (FCs) within Rotary Position Embedding (RoPE) that exhibit high alignment with full attention patterns. These FCs serve as a zero-computation proxy for token importance. Leveraging this insight, FASA enables query-aware dynamic pruning, performing attention computation only on a critical subset of tokens. Experiments demonstrate that FASA achieves near-full-KV performance on benchmarks like LongBench-V1 using just 256 tokens, and attains a 2.56Γ speedup on AIME24 with only 18.9% cache overhead, significantly outperforming current state-of-the-art approaches.
π Abstract
The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of"dominant"FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.