🤖 AI Summary
This work addresses the dynamic evolution of attention mechanisms in Transformers by proposing a novel discrete-time attention model—*localmax dynamics*—which lies between softmax and hardmax. It achieves controllable relaxation of hardmax via a time-varying alignment sensitivity parameter that modulates neighborhood influence. Methodologically, we first define the *quiescent set* to characterize system invariance, then integrate dynamical systems theory, convex geometry, and Lyapunov function analysis to establish asymptotic convergence guarantees under decaying, non-vanishing, and fully time-varying parameters. Theoretically, we prove that the convex hull of attention weights converges to a specific polytope, yet this limit cannot be fully characterized by the maximum-alignment set; moreover, finite-time convergence is unattainable. This framework advances the understanding of asymmetric, time-varying attention dynamics and naturally recovers hardmax as a limiting case.
📝 Abstract
We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.