OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the limited performance of linear attention in contextual recall tasks, primarily attributed to the scalar step size in existing Delta Rule updates that neglects curvature across feature dimensions. The authors propose OSDN, the first method to integrate a provably convergent online diagonal preconditioning mechanism into linear attention. By scaling write keys per feature dimension, OSDN enables efficient chunk-wise parallel computation and establishes a hypergeometric convergence guarantee equivalent to the right Newton method. Additionally, an adaptive preconditioned forgetting mechanism is introduced to handle non-stationary contexts. Experiments demonstrate that OSDN improves contextual recall accuracy by 32% and reduces recall residual ratios by 39% on models with 340M and 1.3B parameters, respectively, while maintaining competitive performance on general downstream tasks.

📝 Abstract

Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.

Problem

Research questions and friction points this paper is trying to address.

linear attention

in-context associative recall

Delta Rule

online preconditioning

feature-wise curvature

Innovation

Methods, ideas, or system contributions that make the work stand out.

online preconditioning

linear attention

Delta Rule