MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

πŸ“… 2026-01-12
πŸ“ˆ Citations: 2
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Linear attention mechanisms reduce computational complexity but often suffer from diminished expressiveness and representational diversity due to global context collapse. This work identifies and addresses this issue for the first time by proposing Multi-Head Linear Attention (MHLA), which partitions tokens into multiple heads to perform linear-complexity attention computations. MHLA restores the expressive capacity of softmax-based attention without incurring additional computational overhead. Experimental results demonstrate that MHLA significantly enhances performance while maintaining the same time complexity: it improves ImageNet classification accuracy by 3.6%, boosts NLP task scores by 6.3%, increases image generation quality by 12.6%, and achieves a 41% improvement in video generation effectiveness.

Technology Category

Application Category

πŸ“ Abstract
While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.
Problem

Research questions and friction points this paper is trying to address.

linear attention
expressivity
computational complexity
Transformer
global context collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear Attention
Multi-Head Attention
Global Context Collapse
Token-Level Partitioning
Efficient Transformers
πŸ”Ž Similar Papers
No similar papers found.