Gated Delta Networks: Improving Mamba2 with Delta Rule

📅 2024-12-09
🏛️ arXiv.org
📈 Citations: 9
Influential: 1
📄 PDF
🤖 AI Summary
Linear Transformers suffer from weak memory control and limited performance in long-context modeling and retrieval tasks. To address this, we propose Gated DeltaNet—a parallelizable hybrid architecture. Its core innovation is the first-ever gated delta update mechanism, enabling coordinated rapid memory erasure and precise incremental updates. Additionally, it seamlessly integrates sliding-window attention with a Mamba2-style linear state space model (SSM). This design jointly preserves local sensitivity and global contextual modeling capability, substantially improving training efficiency and inference stability. Experiments demonstrate that Gated DeltaNet consistently outperforms both Mamba2 and DeltaNet across diverse benchmarks—including language modeling, commonsense reasoning, context retrieval, length extrapolation, and long-text understanding—while achieving superior generalization and higher training throughput.

Technology Category

Application Category

📝 Abstract
Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
Problem

Research questions and friction points this paper is trying to address.

Improves memory control in Linear Transformers using gating and delta rule.
Enhances performance in retrieval and long-context tasks with Gated DeltaNet.
Combines Gated DeltaNet with sliding window attention for better efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Delta Rule for precise memory updates
Parallel training algorithm for modern hardware
Hybrid architectures combining Gated DeltaNet layers
🔎 Similar Papers
No similar papers found.