Pay Attention to Attention Distribution: A New Local Lipschitz Bound for Transformers

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the insufficient local Lipschitz stability of the self-attention mechanism in Transformers. We establish, for the first time, an explicit theoretical connection between the distribution of attention scores and the local Lipschitz constant, revealing that the concentration level of softmax output distributions directly governs model robustness. To this end, we propose JaSMin—a Jacobian-based Softmax regularization method—that leverages a closed-form expression of the spectral norm of the softmax Jacobian matrix to directly constrain attention score distributions in the gradient domain. JaSMin significantly reduces the local Lipschitz constant, yielding consistent improvements in adversarial robustness and generalization across text classification and question answering tasks. Furthermore, empirical evaluation in security-critical settings validates the effectiveness of interpretable, distribution-level control over attention mechanisms.

Technology Category

Application Category

📝 Abstract

We present a novel local Lipschitz bound for self-attention blocks of transformers. This bound is based on a refined closed-form expression for the spectral norm of the softmax function. The resulting bound is not only more accurate than in the prior art, but also unveils the dependence of the Lipschitz constant on attention score maps. Based on the new findings, we suggest an explanation of the way distributions inside the attention map affect the robustness from the Lipschitz constant perspective. We also introduce a new lightweight regularization term called JaSMin (Jacobian Softmax norm Minimization), which boosts the transformer's robustness and decreases local Lipschitz constants of the whole network.

Problem

Research questions and friction points this paper is trying to address.

Develop a new Lipschitz bound for transformer self-attention blocks

Analyze how attention score maps affect Lipschitz constants

Propose JaSMin regularization to enhance transformer robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel local Lipschitz bound for transformers

Refined spectral norm of softmax function

Lightweight regularization term JaSMin introduced

🔎 Similar Papers

No similar papers found.