Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work identifies and disentangles two orthogonal yet previously overlooked failure modes in low-bit floating-point quantization-aware training (HiF8 W8A8): silent forward representation corruption due to amax saturation and catastrophic forgetting induced by high learning rates—both undetectable through training loss. To mitigate these issues without additional supervision, the authors propose a conservative amax scaling strategy based on a 64-step historical window to suppress saturation, coupled with a 500-step BF16 warm-up phase followed by low-learning-rate QAT to prevent catastrophic forgetting. Evaluated on the OpenPangu-Embedded-1B model, this approach achieves near-lossless HiF8 quantization, with only 0.43% MMLU, 0.58% HellaSwag, and 0.22% ARC-Challenge performance degradation and a remarkably low average parameter error (APE) of 0.11% over 10,000 training steps.

📝 Abstract

Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^{-5}. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.

Problem

Research questions and friction points this paper is trying to address.

quantization-aware training

amax saturation

catastrophic forgetting

low-bit floating-point

LLM deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Delayed Tensor Scaling

HiF8 W8A8 Quantization

amax saturation