FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the limitations of existing linear attention mechanisms, which rely on scalar learning rates and thus struggle to achieve dimension-wise adaptive control, hindering effective long-context memory modeling. To overcome this, we introduce channel-wise adaptive learning rates into the Delta rule for the first time and propose a key-value decoupling mechanism that separately modulates the strength of memory writing and erasure, enabling fine-grained regulation. Building upon the Gated Delta Network (GDN) architecture and incorporating channel-wise optimization principles inspired by AdaGrad and Adam, we develop FG²-GDN and its enhanced variant FG²-GDN+. Experiments demonstrate that our models significantly outperform both GDN and KDA on synthetic and real-world long-context tasks, achieving superior associative memory and contextual understanding capabilities while maintaining comparable computational efficiency.

Technology Category

Application Category

📝 Abstract

Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate $β_t$ in the delta update remains a scalar, limiting the model's capacity for dimension-specific adaptation. We introduce FG$^2$-GDN, which replaces the scalar $β_t$ with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG$^2$-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG$^2$-GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

linear attention

delta rule

associative recall

long-context modeling

adaptive learning rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained gating

linear attention

delta rule