Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work investigates the intrinsic mechanism underlying the low-entropy (i.e., polarized) outputs commonly observed in Transformer models and similar architectures. By constructing a value-softmax model and analyzing its dynamics under continuous-time gradient flow, the study examines the joint optimization of the learnable value matrix and attention vectors within self-attention. Theoretical analysis demonstrates that, under various loss functions—including logistic and squared losses—the gradient flow inherently drives the model output toward low-entropy solutions. This paper provides the first gradient-flow-based explanation for the universal polarization effect induced by the softmax structure, offering a unified theoretical foundation for empirical phenomena such as attention concentration and large-scale activation, while establishing a clear connection to the training dynamics of practical Transformer models.

Technology Category

Application Category

📝 Abstract

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as ${L}(\mathbf{V} \sigma(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.

Problem

Research questions and friction points this paper is trying to address.

gradient flow

softmax

low-entropy

non-convex dynamics

transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient flow

softmax polarization

low-entropy solutions