BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance bottleneck of softmax in Transformer inference, which stems from its high memory bandwidth demands and the substantial area overhead of high-precision exponential computation units. To overcome the accuracy limitations of conventional low-precision approaches without significant model degradation, the authors propose a co-designed algorithm-hardware solution featuring an 8-bit floating-point format (HiF8) and block-aware precision rescaling. The method employs HiF8 representation, a lightweight exponential unit, and a bandwidth-optimized data path to enable hardware-friendly computation. Experiments on language and multimodal models demonstrate that the proposed approach reduces data movement bandwidth by 50% and substantially shrinks the EXP2 unit area, enabling a potential doubling of end-to-end inference throughput without increasing chip area.

Technology Category

Application Category

📝 Abstract
As the performance gains from accelerating quantized matrix multiplication plateau, the softmax operation becomes the critical bottleneck in Transformer inference. This bottleneck stems from two hardware limitations: (1) limited data bandwidth between matrix and vector compute cores, and (2) the significant area cost of high-precision (FP32/16) exponentiation units (EXP2). To address these issues, we introduce a novel low-precision workflow that employs a specific 8-bit floating-point format (HiF8) and block-aware precision rescaling for softmax. Crucially, our algorithmic innovations make low-precision softmax feasible without the significant model accuracy loss that hampers direct low-precision approaches. Specifically, our design (i) halves the required data movement bandwidth by enabling matrix multiplication outputs constrained to 8-bit, and (ii) substantially reduces the EXP2 unit area by computing exponentiations in low (8-bit) precision. Extensive evaluation on language models and multi-modal models confirms the validity of our method. By alleviating the vector computation bottleneck, our work paves the way for doubling end-to-end inference throughput without increasing chip area, and offers a concrete co-design path for future low-precision hardware and software.
Problem

Research questions and friction points this paper is trying to address.

softmax
low-precision
Transformer inference
hardware bottleneck
data bandwidth
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-precision softmax
block-aware rescaling
HiF8
Transformer inference
hardware-software co-design
🔎 Similar Papers
No similar papers found.
Z
Zisheng Ye
Taylor Lab, Huawei
X
Xiaoyu He
Taylor Lab, Huawei
M
Maoyuan Song
Taylor Lab, Huawei
Guoliang Qiu
Guoliang Qiu
Shanghai Jiao Tong University
Theoretical Computer Science
C
Chao Liao
Taylor Lab, Huawei
C
Chen Wu
Taylor Lab, Huawei
Y
Yonggang Sun
HiSilicon, Huawei
Z
Zhichun Li
HiSilicon, Huawei
X
Xiaoru Xie
HiSilicon, Huawei
Y
Yuanyong Luo
HiSilicon, Huawei
H
Hu Liu
HiSilicon, Huawei
Pinyan Lu
Pinyan Lu
ITCS, Shanghai University of Finance and Economics
ComplexityAlgorithmGame Theory
H
Heng Liao
HiSilicon, Huawei