🤖 AI Summary
This work identifies a critical failure of the microparameterization (μP) theory in large language model (LLM) pretraining when vocabulary size vastly exceeds embedding dimension: the optimal learning rate ratio between embedding and hidden layers deviates from the μP-predicted Θ(width) scaling and instead follows a newly proposed “Large Vocabulary (LV) paradigm.” The authors establish the first theoretical framework for LV scaling, proving that the embedding-layer learning rate must scale as Θ(√width). Leveraging dynamical systems modeling, rigorous scaling-law derivation, and ablation-free 1B-parameter pretraining from scratch, they demonstrate strong agreement between theory and empirical results. The LV rule substantially improves convergence speed and final model performance, bridging the gap between classical scaling theories and practical large-vocabulary LLM training.
📝 Abstract
Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, $mu P$ (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While $mu$P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind $mu$P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we provide a theoretical analysis of the effect of vocabulary size on training dynamics, and subsequently show that as vocabulary size increases, the training dynamics emph{interpolate between the $mu$P regime and another regime that we call Large Vocab (LV) Regime}, where optimal scaling rules are different from those predicted by $mu$P. Our analysis reveals that in the LV regime, the optimal embedding LR to hidden LR ratio should roughly scale as $Theta(sqrt{width})$, surprisingly close to the empirical findings previously reported in the literature, and different from the $Theta(width)$ ratio predicted by $mu$P. We conduct several experiments to validate our theory, and pretrain a 1B model from scratch to show the benefit of our suggested scaling rule for the embedding LR.