DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the high computational cost of nonlinear activation functions—such as GELU—in edge-deployed Transformers, which hinders both energy efficiency and performance. To overcome this limitation, the authors propose DAPA, a differentiable and hardware-friendly piecewise activation function that uniquely incorporates the pre-activation data distribution into its design. Specifically, DAPA employs non-uniform, fine-grained segmentation in high-probability regions to enhance approximation fidelity and introduces a distribution-weighted mean squared error to guide quantization. This approach achieves a favorable trade-off between accuracy and hardware efficiency, matching or surpassing GELU in model accuracy on Vision Transformer and GPT-2 while accelerating activation computation by 16× and reducing DSP resource usage by 16×.

Technology Category

Application Category

📝 Abstract

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16$\times$ and decreases DSP utilization by 16$\times$ while maintaining comparable or better performance across vision Transformers and GPT-2 models.

Problem

Research questions and friction points this paper is trying to address.

on-device inference

activation functions

hardware efficiency

Transformer models

non-linear activation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-Aware

Piecewise Activation

Hardware-Friendly