To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability

📅 2024-05-29

🏛️ arXiv.org

📈 Citations: 9

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work investigates whether FP8 low-precision training can safely replace BF16 in large language model (LLM) pretraining, balancing cost efficiency and training stability. To this end, the authors propose a novel sharpness-aware metric for quantifying loss surface curvature specifically tailored to autoregressive language models, establishing the first systematic evaluation framework for low-precision robustness. Their methodology integrates floating-point precision simulation, multi-seed and learning-rate sensitivity analysis, and rigorous three-way comparisons across BF16, FP16, and FP8. Results demonstrate that FP8 substantially degrades training stability—particularly in sensitivity to random seeds and learning rate selection. The study provides the first quantitative evidence linking progressive representation collapse to training instability. It precisely characterizes the accuracy–stability trade-off boundary and delivers a reproducible low-precision training benchmark, offering both theoretical foundations and practical guidelines for safe, controllable, and efficient LLM training.

Technology Category

Application Category

📝 Abstract

The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8, with even fewer bits than FP16, can be a cost-effective option for LLM training. We argue that reduced-precision training schemes must have similar training stability and hyperparameter sensitivities to their higher-precision counterparts in order to be cost-effective. However, we find that currently available methods for FP8 training are not robust enough to allow their use as economical replacements. This prompts us to investigate the stability of reduced-precision LLM training in terms of robustness across random seeds and learning rates. To this end, we propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models. By simulating incremental bit reductions in floating-point representations, we analyze the relationship between representational power and training stability with the intent of aiding future research into the field.

Problem

Research questions and friction points this paper is trying to address.

Investigates FP8 stability for cost-effective LLM training

Compares FP8 robustness to BF16 in hyperparameter sensitivity

Proposes metrics to quantify loss landscape sharpness in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates FP8 training stability for LLMs

Proposes new loss landscape sharpness metric

Simulates bit reduction effects on training

🔎 Similar Papers

Scaling FP8 training to trillion-token LLMs