SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficient low-bitwidth attention for both inference and training. We propose the first unified attention framework supporting FP4 inference and 8-bit training. Methodologically, we leverage Blackwell-architecture FP4 Tensor Cores for high-throughput inference, introduce microscaling—a novel attention computation scheme—and design an 8-bit fixed-point quantization strategy, while maintaining full compatibility with the FlashAttention interface. This enables, for the first time, end-to-end low-bitwidth attention across both forward and backward passes. Experiments demonstrate that FP4 attention achieves 1038 TOPS on an RTX 5090, delivering a 5× inference speedup; 8-bit attention fine-tuning incurs zero accuracy degradation, and 8-bit pretraining exhibits marginally slower convergence but remains fully viable. Our framework establishes the first end-to-end, high-performance implementation paradigm for low-bitwidth attention in large language models.

Technology Category

Application Category

📝 Abstract
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.
Problem

Research questions and friction points this paper is trying to address.

Enhancing attention efficiency via FP4 Tensor Cores for faster inference
Exploring 8-bit attention for training tasks to improve efficiency
Achieving lossless performance in fine-tuning with low-bit attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

FP4 Tensor Cores accelerate attention computation
8-bit attention for forward and backward propagation
Plug-and-play FP4 attention for model inference
🔎 Similar Papers
No similar papers found.