DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

212K/year
🤖 AI Summary
This work addresses the high computational cost of nonlinear activation functions—such as GELU—in edge-deployed Transformers, which hinders both energy efficiency and performance. To overcome this limitation, the authors propose DAPA, a differentiable and hardware-friendly piecewise activation function that uniquely incorporates the pre-activation data distribution into its design. Specifically, DAPA employs non-uniform, fine-grained segmentation in high-probability regions to enhance approximation fidelity and introduces a distribution-weighted mean squared error to guide quantization. This approach achieves a favorable trade-off between accuracy and hardware efficiency, matching or surpassing GELU in model accuracy on Vision Transformer and GPT-2 while accelerating activation computation by 16× and reducing DSP resource usage by 16×.

Technology Category

Application Category

📝 Abstract
Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16$\times$ and decreases DSP utilization by 16$\times$ while maintaining comparable or better performance across vision Transformers and GPT-2 models.
Problem

Research questions and friction points this paper is trying to address.

on-device inference
activation functions
hardware efficiency
Transformer models
non-linear activation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-Aware
Piecewise Activation
Hardware-Friendly
Non-Uniform Approximation
On-Device Transformer
🔎 Similar Papers
No similar papers found.