TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Native FP8 training of Transformers frequently fails due to extreme activation outliers, prompting existing solutions to rely on mixed-precision schemes or architectural modifications. Method: Challenging the conventional assumption that outliers stem from data distribution, this work identifies their root cause as structural properties—specifically, collinearity in weight matrices—that induce deterministic outlier generation. We propose TWEO, a non-intrusive, structure-aware loss term that requires no model architecture changes or complex precision scheduling. Results: TWEO enables end-to-end, fully static per-tensor W8A8 quantization across the entire model. It is compatible with both LLMs and ViTs, reduces activation outliers from thousands to under 20 on native FP8 hardware, improves training throughput by 36%, and matches BF16 baseline accuracy. Crucially, it achieves the first truly hardware-friendly, state-of-the-art W8A8 quantization performance on cutting-edge FP8 accelerators.

Technology Category

Application Category

📝 Abstract

Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.

Problem

Research questions and friction points this paper is trying to address.

Extreme activation outliers hinder FP8 training of Transformers on modern hardware

Existing solutions require complex mixed-precision engineering or architectural modifications

Outliers prevent effective W8A8 quantization of large language models due to instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-invasive loss function prevents activation outliers

Enables full-model FP8 training without architectural changes

Achieves state-of-the-art W8A8 quantization performance

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration