🤖 AI Summary
Linear attention in large language models suffers from performance bottlenecks due to inaccurate feature mapping, lack of normalization, and gate saturation. To address these issues, this paper proposes Fine-grained Gated Linear Attention (FGLA). Methodologically, FGLA introduces: (1) a learnable nonlinear feature mapping function to enhance sequence modeling capacity; (2) layer normalization, theoretically justified and empirically validated for improved training stability; and (3) a gating refinement module to mitigate gate saturation. The mechanism is compatible with both de novo training and post-pretraining adaptation under the linear attention paradigm. Extensive experiments across multiple benchmarks demonstrate that FGLA consistently outperforms existing gated linear attention methods, achieving superior trade-offs between computational efficiency and modeling performance.
📝 Abstract
Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.