Enhancing Linear Attention with Residual Learning

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Linear attention achieves linear time/memory complexity but suffers from limited expressivity due to its inherent “historical prediction + single-step correction” modeling paradigm, hindering effective long-range dependency capture. To address this, we propose Residual Linear Attention (RLA), a novel framework that decomposes attention into two complementary pathways: a learnable backbone predictor and a dynamic residual corrector. We design the Residual Delta Network (RDN), which recursively accumulates hidden states and explicitly models historical prediction errors for continuous, adaptive correction. To ensure training stability, we incorporate adaptive gating and residual clipping—without compromising linear complexity. Extensive experiments demonstrate that RLA significantly outperforms state-of-the-art linear attention methods on language modeling and high-recall retrieval tasks, substantially narrowing the performance gap with standard Transformers while preserving computational efficiency.

Technology Category

Application Category

📝 Abstract

Linear attention offers a linear-time alternative to self-attention but often struggles to capture long-range patterns. We revisit linear attention through a prediction-correction lens and show that prevalent variants can be written as a combination of a historical prediction and a single-token correction, which creates an expressivity bottleneck. To address this bottleneck, we introduce Residual Linear Attention (RLA), a framework that equips linear attention with an explicit residual-fitting mechanism. RLA maintains an auxiliary recurrent state that learns to accumulate residual errors over time and correct the base prediction. We further instantiate a delta-rule version, Residual Delta Net (RDN), incorporating adaptive gating and residual clipping for enhanced correction control and stability. Our implementation leverages highly optimized linear attention kernels and preserves linear time and memory. Across language modeling and recall-intensive evaluations, RLA and RDN consistently outperform their respective baselines and other modern linear-attention methods, narrowing the gap to standard Transformers while retaining linear scaling.

Problem

Research questions and friction points this paper is trying to address.

Linear attention struggles to capture long-range patterns effectively

Existing linear attention creates an expressivity bottleneck in predictions

The paper enhances linear attention with residual error correction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residual Linear Attention with explicit residual-fitting mechanism

Auxiliary recurrent state accumulating residual errors over time

Delta-rule version with adaptive gating and residual clipping

🔎 Similar Papers

Residual Connections Harm Generative Representation Learning