SageBwd: A Trainable Low-bit Attention

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and performance degradation commonly observed in low-bit attention mechanisms during training due to quantization errors, which often prevent them from matching full-precision counterparts. We present a systematic optimization of trainable INT8 low-bit attention by quantizing all six matrix multiplications involved in attention computation and integrating QK-norm with K-smoothing to enhance training stability. Through analysis, we identify the gradient of attention scores (dS) in backpropagation as the primary source of quantization error and propose a strategy that reduces the number of tokens processed per step to mitigate this issue. Our method, SageBwd, achieves pretraining performance on par with full-precision attention, marking the first demonstration of the feasibility of INT8 low-bit attention in training scenarios.

Technology Category

Application Category

📝 Abstract
Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.
Problem

Research questions and friction points this paper is trying to address.

low-bit attention
training
pre-training
quantization
performance gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-bit attention
trainable quantization
INT8 training
attention mechanism
pre-training stability
🔎 Similar Papers
No similar papers found.