GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Softmax-based full attention incurs quadratic computational complexity O(n²), while sliding window attention (SWA) achieves linear complexity O(n) but suffers from unbounded memory update degradation; softmax also induces memory contraction and gradient vanishing. To address these issues, we propose GatedFWA—a gated window attention mechanism that unifies dynamic memory updating via learnable per-token and per-head decay biases. It preserves linear time complexity while ensuring numerical stability and healthy gradient flow. GatedFWA integrates gated memory recursion, FlashAttention-compatible kernel functions, and single-pass gate precomputation, enabling efficient I/O under sliding masks. Experiments demonstrate that GatedFWA enables high-throughput autoregressive training with near-zero overhead, significantly improves global context utilization, and natively supports compression techniques such as NSA. It is broadly applicable to diverse autoregressive modeling tasks.

Technology Category

Application Category

📝 Abstract

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an extit{Associative Memory} interpretation, its difference-style update renders the training objective effectively emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-underline{Gated} (underline{F}lash) underline{W}indowed underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.

Problem

Research questions and friction points this paper is trying to address.

Linear-time attention with bounded memory updates

Stabilizing gradient flow in sliding window attention

Efficient global context integration in autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

GatedFWA uses gated associative memory for stable updates

It implements fused gate preprocessing and FlashAttention-compatible kernel

The method integrates with compression techniques for efficiency

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs