NOSA: Native and Offloadable Sparse Attention

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing trainable sparse attention methods cannot compress the key-value (KV) cache, leading to GPU memory bottlenecks and low decoding throughput during long-context inference with large batch sizes. This work proposes the first trainable sparse attention framework natively supporting KV cache offloading. We first uncover a strong locality pattern in sparse attention during autoregressive decoding; leveraging this insight, we design a dual-branch dynamic selection mechanism—comprising both query-aware and query-agnostic components—that enables efficient CPU-GPU collaborative offloading without additional training overhead. Integrated with KV chunking, explicit locality regularization, and lightweight scheduling, our approach preserves exact original attention computation. Evaluated on a 1B-parameter model, our method achieves up to 2.3× higher decoding throughput than InfLLM-V2 while sustaining near-lossless model accuracy.

Technology Category

Application Category

📝 Abstract
Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).
Problem

Research questions and friction points this paper is trying to address.

Reducing KV cache size to improve GPU batch processing efficiency
Minimizing CPU-GPU transfers for sparse attention in long-context decoding
Maintaining near-lossless performance while accelerating large-scale batched inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

NOSA introduces explicit locality constraints for sparse attention
It decomposes token selection into query-aware and query-agnostic components
This enables KV cache offloading while preserving training attention computation
🔎 Similar Papers
No similar papers found.
Yuxiang Huang
Yuxiang Huang
Tsinghua University
Efficient AINatural Language ProcessingMachine Learning System
Chaojun Xiao
Chaojun Xiao
Postdoctoral Researcher, Tsinghua University
Large Language Model
X
Xu Han
Department of Computer Science and Technology, Tsinghua University
Z
Zhiyuan Liu
Department of Computer Science and Technology, Tsinghua University