Characterization and Mitigation of Training Instabilities in Microscaling Formats

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work identifies a critical issue of sharp loss fluctuations during low-precision training of large language models (LLMs) in formats such as Microscaling (MX). We demonstrate that quantization of layer normalization parameters and key activations introduces multiplicative gradient bias, leading to training divergence. To address this, we propose a dynamic precision adjustment strategy integrating block-wise scaling quantization, low-precision GEMM, and in-situ intervention. Through systematic ablation studies across diverse precision configurations, we validate the universality of this phenomenon across nearly one thousand LLMs trained from scratch. Our mixed-precision configuration achieves convergence stability and final model performance on par with full-precision baselines, while significantly reducing computational overhead. The approach provides an interpretable, reproducible, and robust stability guarantee for low-precision LLM training.

Technology Category

Application Category

📝 Abstract

Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 imes 10^{17}$ to $4.8 imes 10^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through emph{in situ} intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at https://github.com/Hither1/systems-scaling.

Problem

Research questions and friction points this paper is trying to address.

Investigates training instabilities in Microscaling (MX) formats

Explores causes of loss divergence in low-precision training

Proposes stabilization strategies for efficient LLM training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses block-scaled MX formats for efficiency

Identifies gradient bias as instability cause

Proposes hybrid precision for stabilization

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models