🤖 AI Summary
This work addresses the high computational cost of nonlinear activation functions—such as GELU—in edge-deployed Transformers, which hinders both energy efficiency and performance. To overcome this limitation, the authors propose DAPA, a differentiable and hardware-friendly piecewise activation function that uniquely incorporates the pre-activation data distribution into its design. Specifically, DAPA employs non-uniform, fine-grained segmentation in high-probability regions to enhance approximation fidelity and introduces a distribution-weighted mean squared error to guide quantization. This approach achieves a favorable trade-off between accuracy and hardware efficiency, matching or surpassing GELU in model accuracy on Vision Transformer and GPT-2 while accelerating activation computation by 16× and reducing DSP resource usage by 16×.
📝 Abstract
Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16$\times$ and decreases DSP utilization by 16$\times$ while maintaining comparable or better performance across vision Transformers and GPT-2 models.