Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work identifies how the Softmax temperature parameter systematically induces representation collapse and low-rank bias, degrading generalization and out-of-distribution (OOD) robustness. Addressing the lack of theoretical understanding of the interplay among temperature, representation geometry, and generalization, we introduce the novel concept of “rank-deficiency bias” and derive an analytical relationship linking logit norm, temperature, and representation rank. We further design a temperature-driven, controllable representation compression mechanism that achieves over 30% dimensionality reduction while preserving in-distribution classification accuracy. Through rigorous analysis of Softmax dynamics, spectral-theoretic modeling, and extensive experiments across architectures (CNNs and Transformers) and datasets, our method consistently improves OOD accuracy. The work establishes a new theoretical framework for interpretable representation control and provides a practical pathway toward robust training.

Technology Category

Application Category

📝 Abstract

The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.

Problem

Research questions and friction points this paper is trying to address.

Understanding softmax's impact on learning dynamics and representations

Addressing rank deficit bias in softmax-based deep networks

Exploiting softmax dynamics for compressed and generalized representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces rank deficit bias concept

Exploits softmax dynamics for compression

Uses temperature tuning broadly

🔎 Similar Papers

Softmax is not Enough (for Sharp Size Generalisation)