u-μP: The Unit-Scaled Maximal Update Parametrization

📅 2024-07-24
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
To address the instability of large language models (LLMs) under low-precision training (e.g., FP8) and the model-size-dependent, non-transferable nature of hyperparameter tuning across scales, this paper proposes u-μP—a unified parameterization scheme that jointly integrates maximal-update parameterization (μP) and unit scaling. u-μP aligns the initial scales of activations, weights, and gradients to unity, yielding scale-invariant default hyperparameters. This eliminates the need for precision-specific hyperparameter search: FP8 training becomes plug-and-play, and hyperparameters tuned on small proxy models transfer directly to large-scale models without modification. Experiments demonstrate that u-μP ensures stable convergence in FP8, achieves validation loss on par with standard μP, and drastically improves hyperparameter scanning efficiency—thereby balancing training stability, generalization, and engineering practicality.

Technology Category

Application Category

📝 Abstract
The Maximal Update Parametrization ($mu$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$mu$P, which improves upon $mu$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $mu$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$mu$P models reaching a loss that is equal to or lower than comparable $mu$P models and working out-of-the-box in FP8.
Problem

Research questions and friction points this paper is trying to address.

Low-Precision Computing
Large-Scale Model
Efficient Testing and Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

u-μP
model optimization
FP8 training
🔎 Similar Papers
No similar papers found.
C
Charlie Blake
Graphcore
C
C. Eichenberg
Aleph Alpha
J
Josef Dean
Graphcore
L
Lukas Balles
Aleph Alpha
L
Luke Y. Prince
Graphcore
B
Bjorn Deiseroth
Aleph Alpha
A
Andres Felipe Cruz Salinas
Cohere
Carlo Luschi
Carlo Luschi
VP & Head of Research, Graphcore
Artificial IntelligenceNeural NetworksDeep LearningGraph Learning
S
Samuel Weinbach
Aleph Alpha
Douglas Orr
Douglas Orr
Graphcore
deep learning