🤖 AI Summary
Softmax in Transformers imposes significant performance and energy-efficiency bottlenecks due to its nonlinear exponential computation. This paper addresses this challenge by proposing a low-overhead ISA extension for RISC-V, featuring the first Bfloat16-optimized exponential unit integrating Schraudolph’s approximation algorithm—implemented with only 1% area overhead within the FPU to enable hardware-software co-acceleration. Coupled with FlashAttention-2 kernel optimizations and a multi-cluster distributed inference framework, the approach reduces Softmax latency by 162.7× and energy consumption by 74.3×. On GPT-2, FlashAttention-2 achieves 8.2× higher throughput and 4.1× better energy efficiency. End-to-end inference for GPT-2/3 and ViT sees up to 5.8× lower latency and 3.6× reduced energy consumption, with negligible accuracy degradation.
📝 Abstract
While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7$ imes$ less latency and 74.3$ imes$ less energy compared to the baseline cluster, achieving an 8.2$ imes$ performance improvement and 4.1$ imes$ higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8$ imes$ and 3.6$ imes$ reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.