Softmax is $1/2$-Lipschitz: A tight bound across all $ell_p$ norms

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work resolves the long-standing question of whether the softmax function admits a uniform and tight Lipschitz constant across all ℓₚ norms (p ≥ 1). Leveraging tools from functional analysis and matrix differential calculus, we rigorously characterize the spectral properties of its Jacobian and employ limit analysis to establish that softmax is globally 1/2-Lipschitz with respect to any ℓₚ norm (p ≥ 1), and this bound is tight: equality holds for p = 1 and p = ∞, while strict inequality holds for all p ∈ (1, ∞). This is the first derivation of a unified, tight Lipschitz constant—significantly improving upon the classical loose upper bound of 1. The theoretical advance directly strengthens robustness guarantees and convergence analyses for attention mechanisms in vision transformers (ViT), language models (GPT-2, Qwen3-8B), and reinforcement learning policies. Extensive numerical experiments fully corroborate the theoretical findings.

Technology Category

Application Category

📝 Abstract
The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $ell_p$ norms with $p ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = infty$, and for $p in (1,infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.
Problem

Research questions and friction points this paper is trying to address.

Proving softmax has uniform 1/2-Lipschitz constant across all ℓp norms
Establishing tight bounds for local Lipschitz constants in extreme norms
Demonstrating improved robustness and convergence guarantees with sharper constant
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proved softmax Lipschitz constant is 1/2
Uniform bound across all p-norm conditions
Empirical validation on transformers and RL
🔎 Similar Papers
No similar papers found.