Controlling changes to attention logits

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
During Transformer training, query (Q) and key (K) weight magnitudes often grow uncontrollably, causing large fluctuations in attention logits and training instability. While existing QK normalization techniques mitigate this issue, they require full instantiation of Q/K matrices during inference—rendering them incompatible with implicit-attention architectures such as Multi-Head Latent Attention (MLA). To address this, we propose Logit-Adaptive Learning Rate (LALR), a parameter-dependent dynamic learning rate scheduling mechanism that directly constrains the magnitude of logit updates in the gradient step, without explicitly constructing Q/K. LALR preserves MLA’s architectural advantages—including memory and computation efficiency—while substantially increasing tolerance to higher base learning rates. Empirically, LALR outperforms prior methods under MLA configurations and matches the stability and convergence of QK normalization in standard multi-head attention, all without architectural modification or inference overhead.

Technology Category

Application Category

📝 Abstract
Stability of neural network weights is critical when training transformer models. The query and key weights are particularly problematic, as they tend to grow large without any intervention. Applying normalization to queries and keys, known as `QK norm', fixes stability issues in practice, but is not always applicable. For example, QK norm is not compatible with Multi Latent Attention (MLA) because QK norm requires full materialization of queries and keys during inference, which is not done in MLA. In this paper we suggest that controlling the changes to logits is important for stability. We show that these changes are controllable by assigning parameter-dependent learning rates to the query and key weights. We find that our cheap intervention allows us to increase the base learning rate of the network, outperform other methods in the MLA setting, and achieve performance competitive with QK norm when using Multi-head Attention.
Problem

Research questions and friction points this paper is trying to address.

Addresses transformer weight instability during training by controlling attention logit changes
Solves query/key weight growth issues without requiring QK normalization
Enables stable training for Multi Latent Attention through parameter-dependent learning rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Assigning parameter-dependent learning rates to queries and keys
Controlling changes to attention logits for stability
Increasing base learning rate while maintaining weight stability
🔎 Similar Papers
No similar papers found.