SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the challenge of efficient low-bitwidth attention for both inference and training. We propose the first unified attention framework supporting FP4 inference and 8-bit training. Methodologically, we leverage Blackwell-architecture FP4 Tensor Cores for high-throughput inference, introduce microscaling—a novel attention computation scheme—and design an 8-bit fixed-point quantization strategy, while maintaining full compatibility with the FlashAttention interface. This enables, for the first time, end-to-end low-bitwidth attention across both forward and backward passes. Experiments demonstrate that FP4 attention achieves 1038 TOPS on an RTX 5090, delivering a 5× inference speedup; 8-bit attention fine-tuning incurs zero accuracy degradation, and 8-bit pretraining exhibits marginally slower convergence but remains fully viable. Our framework establishes the first end-to-end, high-performance implementation paradigm for low-bitwidth attention in large language models.

Technology Category

Application Category

📝 Abstract

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.

Problem

Research questions and friction points this paper is trying to address.

Enhancing attention efficiency via FP4 Tensor Cores for faster inference

Exploring 8-bit attention for training tasks to improve efficiency

Achieving lossless performance in fine-tuning with low-bit attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

FP4 Tensor Cores accelerate attention computation

8-bit attention for forward and backward propagation

Plug-and-play FP4 attention for model inference

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration