Self-Adjust Softmax

📅 2025-02-25

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address the gradient vanishing problem of Softmax under extreme attention scores in Transformers, this paper proposes Scalable Adaptive Softmax (SA-Softmax). Its core innovation is an input-dependent linear weighting mechanism, which—through theoretical analysis—is proven to improve the lower bound of gradients and enable lossless replacement of standard attention modules. The method is both lightweight and theoretically grounded: it introduces only minimal learnable scaling parameters without additional computational overhead. Extensive experiments across multilingual benchmarks, diverse datasets, and various positional encoding schemes demonstrate that SA-Softmax consistently accelerates convergence and improves final model performance on models up to 2.7B parameters. It effectively mitigates gradient vanishing while maintaining inference latency—no increase in computational cost during deployment.

Technology Category

Application Category

📝 Abstract

The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one, achieving superior performances over other alternative functions. However, the softmax function can face a gradient vanishing issue when some elements of the attention scores approach extreme values, such as probabilities close to one or zero. In this paper, we propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac{(x - min(x_{min},0))}{max(0,x_{max})-min(x_{min},0)} cdot softmax(x)$. We theoretically show that SA-Softmax provides enhanced gradient properties compared to the vanilla softmax function. Moreover, SA-Softmax Attention can be seamlessly integrated into existing Transformer models to their attention mechanisms with minor adjustments. We conducted experiments to evaluate the empirical performance of Transformer models using SA-Softmax compared to the vanilla softmax function. These experiments, involving models with up to 2.7 billion parameters, are conducted across diverse datasets, language tasks, and positional encoding methods.

Problem

Research questions and friction points this paper is trying to address.

Address gradient vanishing in softmax

Enhance Transformer attention mechanisms

Integrate SA-Softmax with minor adjustments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified softmax function

Enhanced gradient properties

Seamless Transformer integration

🔎 Similar Papers

No similar papers found.