Oscillation-Reduced MXFP4 Training for Vision Transformers

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe accuracy degradation of Vision Transformers (ViTs) under MXFP4 ultra-low-precision training, this work identifies quantization-induced weight oscillation during forward propagation as the primary cause—a finding not previously reported. We propose TetraJet, a novel training framework comprising two core innovations: (1) an Exponential Moving Average Quantizer (Q-EMA), which mitigates weight oscillation via smoothed quantization; and (2) an adaptive boosting optimizer (Q-Ramping), enabling dynamic learning-rate scheduling and co-optimized 4-bit forward/backward passes. Built upon the MXFP4 microscaling format, TetraJet reduces accuracy loss by over 50% on ViT models relative to prior 4-bit methods, achieving performance comparable to full-precision training. It establishes the first systematic solution for ultra-low-precision training of vision foundation models, significantly advancing the state of the art in hardware-efficient deep learning.

Technology Category

Application Category

📝 Abstract
Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason. In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA&Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than $50%$ compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at https://github.com/thu-ml/TetraJet-MXFP4Training
Problem

Research questions and friction points this paper is trying to address.

Reduces accuracy loss in FP4 precision training
Addresses weight oscillation in MXFP4 training
Improves Vision Transformers' performance with 4-bit precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

TetraJet method enhances FP4 training accuracy
EMA Quantizer reduces weight oscillation effectively
Adaptive Ramping Optimizer improves training stability
🔎 Similar Papers
No similar papers found.