π€ AI Summary
Softmax-based full attention incurs quadratic computational complexity O(nΒ²), while sliding window attention (SWA) achieves linear complexity O(n) but suffers from unbounded memory update degradation; softmax also induces memory contraction and gradient vanishing. To address these issues, we propose GatedFWAβa gated window attention mechanism that unifies dynamic memory updating via learnable per-token and per-head decay biases. It preserves linear time complexity while ensuring numerical stability and healthy gradient flow. GatedFWA integrates gated memory recursion, FlashAttention-compatible kernel functions, and single-pass gate precomputation, enabling efficient I/O under sliding masks. Experiments demonstrate that GatedFWA enables high-throughput autoregressive training with near-zero overhead, significantly improves global context utilization, and natively supports compression techniques such as NSA. It is broadly applicable to diverse autoregressive modeling tasks.
π Abstract
Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an extit{Associative Memory} interpretation, its difference-style update renders the training objective effectively emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-underline{Gated} (underline{F}lash) underline{W}indowed underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.