GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

πŸ“… 2025-12-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Softmax-based full attention incurs quadratic computational complexity O(nΒ²), while sliding window attention (SWA) achieves linear complexity O(n) but suffers from unbounded memory update degradation; softmax also induces memory contraction and gradient vanishing. To address these issues, we propose GatedFWAβ€”a gated window attention mechanism that unifies dynamic memory updating via learnable per-token and per-head decay biases. It preserves linear time complexity while ensuring numerical stability and healthy gradient flow. GatedFWA integrates gated memory recursion, FlashAttention-compatible kernel functions, and single-pass gate precomputation, enabling efficient I/O under sliding masks. Experiments demonstrate that GatedFWA enables high-throughput autoregressive training with near-zero overhead, significantly improves global context utilization, and natively supports compression techniques such as NSA. It is broadly applicable to diverse autoregressive modeling tasks.

Technology Category

Application Category

πŸ“ Abstract
Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an extit{Associative Memory} interpretation, its difference-style update renders the training objective effectively emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-underline{Gated} (underline{F}lash) underline{W}indowed underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.
Problem

Research questions and friction points this paper is trying to address.

Linear-time attention with bounded memory updates
Stabilizing gradient flow in sliding window attention
Efficient global context integration in autoregressive models
Innovation

Methods, ideas, or system contributions that make the work stand out.

GatedFWA uses gated associative memory for stable updates
It implements fused gate preprocessing and FlashAttention-compatible kernel
The method integrates with compression techniques for efficiency
πŸ”Ž Similar Papers
No similar papers found.
Jiaxu Liu
Jiaxu Liu
University of Liverpool
Efficient AIReinforcement LearningGeometric Deep Learning
Y
Yuhe Bai
Sorbonne University
C
Christos-Savvas Bouganis
Imperial College London