Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

๐Ÿ“… 2025-04-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing sparse attention mechanismsโ€”e.g., neighborhood attentionโ€”are hindered by poor hardware adaptability and insufficient theoretical modeling, preventing consistent outperformance over dense self-attention baselines and impeding scalable vision foundation models. This paper proposes Generalized Neighborhood Attention (GNA), the first unified framework jointly modeling sliding-window, strided-window, and block-sparse patterns to support multi-dimensional locality. We introduce the first analytically tractable hardware performance model for sparse attention, enabling rigorous throughput prediction. Furthermore, we open-source a high-fidelity simulator and Blackwell-native fused kernels, integrating CUTLASS-optimized GEMM, block-sparse scheduling, and FP16 tensor-core acceleration. On the B200 GPU, our implementation achieves 1.3 petaFLOPs/s FP16 utilizationโ€”reaching the theoretical speed limitโ€”while delivering end-to-end speedups of 28โ€“46% on models including Cosmos-7B, with no fine-tuning required.

Technology Category

Application Category

๐Ÿ“ Abstract
Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.
Problem

Research questions and friction points this paper is trying to address.

Improving sparse attention speed over self-attention baseline
Reducing O(n^2) complexity in vision models via reliable sparsity
Optimizing attention mechanisms for modern AI hardware architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized Neighborhood Attention for multi-dimensional sparsity
Simulator for realistic speedup upper bounds
Optimized GNA on NVIDIA Blackwell architecture
๐Ÿ”Ž Similar Papers
No similar papers found.