Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses the suboptimal matrix layer norms in language models caused by the interplay between weight decay and gradient noise, which constrains model performance. To overcome this, the authors propose learnable scalar, row-wise, and column-wise multipliers as a generalization of muP multipliers, thereby removing implicit constraints on matrix scaling during training and enabling the model to adaptively learn optimal scaling. The method is trained end-to-end using both Adam and Muon optimizers, not only circumventing fixed-scale limitations but also revealing novel phenomena such as forward symmetry and width scaling. On downstream tasks, the approach significantly outperforms carefully tuned muP baselines, with performance gains comparable to those achieved by switching from Adam to the Muon optimizer.

Technology Category

Application Category

📝 Abstract

Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.

Problem

Research questions and friction points this paper is trying to address.

weight decay

matrix scaling

language model training

learnable multipliers

norm equilibrium

Innovation

Methods, ideas, or system contributions that make the work stand out.

learnable multipliers

weight decay

muP

scale adaptation

language model optimization

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models

2023-12-28arXiv.orgCitations: 15

💼 Related Jobs

Research Engineer, Monetization AI