🤖 AI Summary
This work addresses the performance bottleneck of softmax in Transformer inference, which stems from its high memory bandwidth demands and the substantial area overhead of high-precision exponential computation units. To overcome the accuracy limitations of conventional low-precision approaches without significant model degradation, the authors propose a co-designed algorithm-hardware solution featuring an 8-bit floating-point format (HiF8) and block-aware precision rescaling. The method employs HiF8 representation, a lightweight exponential unit, and a bandwidth-optimized data path to enable hardware-friendly computation. Experiments on language and multimodal models demonstrate that the proposed approach reduces data movement bandwidth by 50% and substantially shrinks the EXP2 unit area, enabling a potential doubling of end-to-end inference throughput without increasing chip area.
📝 Abstract
As the performance gains from accelerating quantized matrix multiplication plateau, the softmax operation becomes the critical bottleneck in Transformer inference. This bottleneck stems from two hardware limitations: (1) limited data bandwidth between matrix and vector compute cores, and (2) the significant area cost of high-precision (FP32/16) exponentiation units (EXP2). To address these issues, we introduce a novel low-precision workflow that employs a specific 8-bit floating-point format (HiF8) and block-aware precision rescaling for softmax. Crucially, our algorithmic innovations make low-precision softmax feasible without the significant model accuracy loss that hampers direct low-precision approaches. Specifically, our design (i) halves the required data movement bandwidth by enabling matrix multiplication outputs constrained to 8-bit, and (ii) substantially reduces the EXP2 unit area by computing exponentiations in low (8-bit) precision. Extensive evaluation on language models and multi-modal models confirms the validity of our method. By alleviating the vector computation bottleneck, our work paves the way for doubling end-to-end inference throughput without increasing chip area, and offers a concrete co-design path for future low-precision hardware and software.