MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the instability in large language model pretraining caused by gradient explosion, which leads to substantial computational waste. The study identifies, for the first time, that decaying stable rank and alignment of Jacobian matrices between adjacent layers are key mechanisms driving gradient explosion. Building on this insight, the authors propose MSign, a lightweight optimizer that periodically applies the matrix sign function to restore the stable rank of weight matrices, thereby effectively suppressing exponential gradient growth. Experiments demonstrate that MSign significantly prevents training collapse across models ranging from 5M to 3B parameters, with a computational overhead of less than 7.0% and no substantial increase in overall training cost.

Technology Category

Application Category

📝 Abstract

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via $\mu$P, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

Problem

Research questions and friction points this paper is trying to address.

training instability

large language models

gradient explosion

stable rank

pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

MSign

stable rank

training instability