A Systematic Analysis of Hybrid Linear Attention

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer models face quadratic time complexity and memory bottlenecks when processing long sequences. While linear attention mechanisms alleviate computational overhead, their standalone recurrent capacity is insufficient for long-range modeling, prompting the adoption of hybrid architectures combining linear and full attention. This work systematically evaluates six families of linear attention variants—including vector recurrence and gating-based designs—within hybrid backbones. We train 72 models across 340M and 1.3B parameter scales, varying the linear-to-full attention ratio. Contrary to intuition, the standalone performance of linear components does not predict their efficacy in hybrid settings. We identify selective gating, hierarchical recurrence, and controllable forgetting as critical design principles and recommend an optimal linear:full attention ratio of 3:1–6:1. Empirical results demonstrate that our hybrid architecture achieves Transformer-level long-range memory capability at substantially reduced computational cost. All models are publicly released.

Technology Category

Application Category

📝 Abstract
Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that superior standalone linear models do not necessarily excel in hybrids. While language modeling remains stable across linear-to-full attention ratios, recall significantly improves with increased full attention layers, particularly below a 3:1 ratio. Our study highlights selective gating, hierarchical recurrence, and controlled forgetting as critical for effective hybrid models. We recommend architectures such as HGRN-2 or GatedDeltaNet with a linear-to-full ratio between 3:1 and 6:1 to achieve Transformer-level recall efficiently. Our models are open-sourced at https://huggingface.co/collections/m-a-p/hybrid-linear-attention-research-686c488a63d609d2f20e2b1e.
Problem

Research questions and friction points this paper is trying to address.

Evaluating linear attention models in hybrid architectures
Assessing recall performance versus full attention layers
Identifying optimal linear-to-full attention ratios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid linear and full attention architectures
Systematic evaluation of six linear attention variants
Open-sourced 72 models for comprehensive analysis
🔎 Similar Papers
No similar papers found.