Lost in Backpropagation: The LM Head is a Gradient Bottleneck

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work identifies a critical optimization bottleneck in language model output layers caused by severe gradient compression due to dimensional mismatch. During backpropagation, 95–99% of the gradient norm is suppressed by low-rank projection, distorting parameter update directions and impeding effective learning. Extending the softmax bottleneck beyond representational capacity to the optimization domain, this study reveals that the language model head constitutes a pervasive gradient bottleneck. Through theoretical analysis, empirical gradient norm measurements, and controlled pretraining experiments, the authors systematically quantify how this bottleneck constrains gradient flow and learning capability. Results demonstrate that the issue prevents models from learning even simple patterns and substantially distorts training dynamics, exposing a fundamental flaw in conventional head architectures and underscoring the need for novel designs.

Technology Category

Application Category

📝 Abstract

The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.

Problem

Research questions and friction points this paper is trying to address.

softmax bottleneck

gradient bottleneck

language model head

backpropagation

optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient bottleneck

softmax bottleneck

language model head