HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional sparse attention methods, which rely on auxiliary proxies to predict important tokens and struggle to effectively compress the key-value (KV) cache, thereby constraining computational and memory efficiency. The authors propose HySparse, a novel architecture that alternates full attention layers with sparse attention layers. The full attention layers serve as precise oracles, directly guiding token selection in subsequent sparse layers while sharing their KV cache to jointly reduce both computation and storage overhead. Without introducing additional model complexity, HySparse consistently outperforms full attention and sliding window attention (SWA) baselines across both 7B dense and 80B mixture-of-experts (MoE) models. Notably, in the 80B model, only five full attention layers achieve nearly 10× KV cache compression alongside significant performance gains.

Technology Category

Application Category

📝 Abstract
This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.
Problem

Research questions and friction points this paper is trying to address.

sparse attention
token selection
KV cache
memory efficiency
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Sparse Attention
Oracle Token Selection
KV Cache Sharing
Efficient Transformer
Memory-Efficient Inference
🔎 Similar Papers
No similar papers found.
Y
Yizhao Gao
LLM-Core, Xiaomi
Jianyu Wei
Jianyu Wei
USTC & MSRA Joint PhD
LLM InfraInference SystemQuantizationKernelCo-design
Q
Qihao Zhang
LLM-Core, Xiaomi
Y
Yu Cheng
LLM-Core, Xiaomi
S
Shimao Chen
LLM-Core, Xiaomi
Zhengju Tang
Zhengju Tang
Peking University
Z
Zi-Ang Jiang
LLM-Core, Xiaomi
Y
Yi-Hao Song
LLM-Core, Xiaomi
H
Hailin Zhang
LLM-Core, Xiaomi
L
Liang Zhao
LLM-Core, Xiaomi
B
Bo Yang
LLM-Core, Xiaomi
G
Gang Wang
LLM-Core, Xiaomi
Shijie Cao
Shijie Cao
Microsoft Research Asia
Efficient Deep LearningDeep Learning SystemComputer Architecture
F
Fuli Luo
LLM-Core, Xiaomi