Softmax is not Enough (for Sharp Size Generalisation)

๐Ÿ“… 2024-10-01
๐Ÿ“ˆ Citations: 18
โœจ Influential: 3
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work exposes a fundamental limitation of softmax: its inability to robustly approximate sharp functions under large-scale inputs. We theoretically prove that, for basic tasks such as maximum-key retrieval, any learnable circuit based on softmax necessarily suffers decision degradation at test time as the number of items growsโ€”thereby refuting its scale robustness. Unlike empirical tuning, this is the first rigorous theoretical characterization of this deficiency. To address it, we propose an inference-time adaptive temperature scaling mechanism that dynamically adjusts the softmax temperature in an input-aware manner to enhance attention sharpness. Leveraging probabilistic modeling and differentiable analysis, our theoretical derivation and empirical evaluation consistently demonstrate that the mechanism significantly improves decision sharpness and generalization performance at scale. This provides a novel paradigm for achieving scale-invariant sharp computation.

Technology Category

Application Category

๐Ÿ“ Abstract
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from"circuits"which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
Problem

Research questions and friction points this paper is trying to address.

Softmax fails to generalize sharply with increasing problem size
Learned circuits disperse as number of test items grows
Softmax has fundamental limitation approximating sharp functions robustly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive temperature technique for sharpness
Theoretical proof of softmax limitation
Differentiable query-key lookup enhancement
๐Ÿ”Ž Similar Papers
No similar papers found.