VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Softmax in Transformers imposes significant performance and energy-efficiency bottlenecks due to its nonlinear exponential computation. This paper addresses this challenge by proposing a low-overhead ISA extension for RISC-V, featuring the first Bfloat16-optimized exponential unit integrating Schraudolph’s approximation algorithm—implemented with only 1% area overhead within the FPU to enable hardware-software co-acceleration. Coupled with FlashAttention-2 kernel optimizations and a multi-cluster distributed inference framework, the approach reduces Softmax latency by 162.7× and energy consumption by 74.3×. On GPT-2, FlashAttention-2 achieves 8.2× higher throughput and 4.1× better energy efficiency. End-to-end inference for GPT-2/3 and ViT sees up to 5.8× lower latency and 3.6× reduced energy consumption, with negligible accuracy degradation.

Technology Category

Application Category

📝 Abstract

While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. To address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1%. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7$ imes$ less latency and 74.3$ imes$ less energy compared to the baseline cluster, achieving an 8.2$ imes$ performance improvement and 4.1$ imes$ higher energy efficiency for the FlashAttention-2 kernel in GPT-2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT-2, GPT-3 and ViT, achieving up to 5.8$ imes$ and 3.6$ imes$ reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss.

Problem

Research questions and friction points this paper is trying to address.

Accelerate Softmax computation in Transformers efficiently

Reduce latency and energy in Softmax operations

Enable faster inference for pre-trained Transformer models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom Bfloat16 exponentiation block using Schraudolph's method

RISC-V ISA extension for Softmax acceleration

Optimized software kernels for low latency and energy

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow