🤖 AI Summary
This work identifies how the Softmax temperature parameter systematically induces representation collapse and low-rank bias, degrading generalization and out-of-distribution (OOD) robustness. Addressing the lack of theoretical understanding of the interplay among temperature, representation geometry, and generalization, we introduce the novel concept of “rank-deficiency bias” and derive an analytical relationship linking logit norm, temperature, and representation rank. We further design a temperature-driven, controllable representation compression mechanism that achieves over 30% dimensionality reduction while preserving in-distribution classification accuracy. Through rigorous analysis of Softmax dynamics, spectral-theoretic modeling, and extensive experiments across architectures (CNNs and Transformers) and datasets, our method consistently improves OOD accuracy. The work establishes a new theoretical framework for interpretable representation control and provides a practical pathway toward robust training.
📝 Abstract
The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.