LASER: Attention with Exponential Transformation

πŸ“… 2024-11-05
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Softmax-based dot-product attention in Transformers suffers from vanishing gradient magnitudes during backpropagation, leading to slow updates of Query/Key parameters and inefficient training. To address this, we propose LASER attentionβ€”a differentiable exponential normalization scheme that replaces softmax. Through theoretical analysis and gradient characterization, we rigorously demonstrate that LASER significantly amplifies gradient signals propagated to upstream Query/Key parameters. Crucially, LASER requires no architectural modifications and integrates seamlessly into diverse Transformer variants with only minor hyperparameter tuning. Empirically, on a 2.2B-parameter autoregressive LLM, it improves downstream task performance by ~1% on average; on ViT (ImageNet), top-1 accuracy increases by +4.67%; on Conformer (LibriSpeech), WER decreases by βˆ’2.25%; and on BERT, prediction error rate drops by βˆ’0.93%. LASER is the first attention mechanism to simultaneously achieve strong gradient propagation, formal theoretical guarantees, and deployment-friendly simplicity.

Technology Category

Application Category

πŸ“ Abstract
Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER Attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 2.2 billion parameters where we show upto 3.38% and an average of ~1% improvement over standard attention on downstream evaluations. Using LASER gives the following relative improvements in generalization performance across a variety of tasks (vision, text and speech): 4.67% accuracy in Vision Transformer (ViT) on Imagenet, 2.25% error rate in Conformer on the Librispeech speech-to-text and 0.93% fraction of incorrect predictions in BERT with 2.2 billion parameters.
Problem

Research questions and friction points this paper is trying to address.

Improving gradient signal in Transformer attention mechanisms
Enhancing learning efficiency in large language models
Generalizing performance across vision, text, and speech tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

LASER replaces softmax with exponential transformation
Enhances gradient signal in attention mechanism
Improves performance across vision, text, speech tasks
πŸ”Ž Similar Papers
No similar papers found.