Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies how the Softmax temperature parameter systematically induces representation collapse and low-rank bias, degrading generalization and out-of-distribution (OOD) robustness. Addressing the lack of theoretical understanding of the interplay among temperature, representation geometry, and generalization, we introduce the novel concept of “rank-deficiency bias” and derive an analytical relationship linking logit norm, temperature, and representation rank. We further design a temperature-driven, controllable representation compression mechanism that achieves over 30% dimensionality reduction while preserving in-distribution classification accuracy. Through rigorous analysis of Softmax dynamics, spectral-theoretic modeling, and extensive experiments across architectures (CNNs and Transformers) and datasets, our method consistently improves OOD accuracy. The work establishes a new theoretical framework for interpretable representation control and provides a practical pathway toward robust training.

Technology Category

Application Category

📝 Abstract
The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.
Problem

Research questions and friction points this paper is trying to address.

Understanding softmax's impact on learning dynamics and representations
Addressing rank deficit bias in softmax-based deep networks
Exploiting softmax dynamics for compressed and generalized representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces rank deficit bias concept
Exploits softmax dynamics for compression
Uses temperature tuning broadly
🔎 Similar Papers
Wojciech Masarczyk
Wojciech Masarczyk
Warsaw University of Technology
Continual LearningLifelong LearningDeep Learning
M
M. Ostaszewski
Warsaw University of Technology, Poland
Tin Sum Cheng
Tin Sum Cheng
University of Basel
deep learning theoryoptimization algorithmmolecular chemistry
T
Tomasz Trzci'nski
Warsaw University of Technology, Poland, IDEAS Research Institute, Poland, Tooploox, Poland
A
Aurélien Lucchi
University of Basel, Switzerland
Razvan Pascanu
Razvan Pascanu
Google DeepMind
deep learningreinforcement learningrecurrent neural modelsoptimizationgraph neural networks