🤖 AI Summary
This work addresses the challenge of implementing the Softmax function efficiently on resource-constrained low-end FPGAs. We systematically investigate hardware-friendly approximations—namely Taylor series expansion, Padé rational approximation, and lookup-table (LUT)-based interpolation (including quadratic interpolation)—within a unified framework. We quantitatively compare these methods using root-mean-square error (RMSE) analysis and physical synthesis results, revealing their intrinsic accuracy–latency–power trade-offs. Experimental evaluation shows that quadratic LUT interpolation achieves the highest accuracy (lowest RMSE), while Taylor and Padé approximations yield the lowest latency. All approximations significantly outperform floating-point implementations: latency is reduced by over 2×, and FPGA resource utilization decreases by more than 40%. The proposed designs are fully configurable in precision and seamlessly integrate into edge-deployed deep neural network (DNN) accelerators. This work delivers a practical, portable, and area-efficient Softmax hardware acceleration solution tailored for ultra-low-power edge AI applications.
📝 Abstract
The softmax function is used as an activation function placed in the output layer of a neural network. It allows extracting the probabilities of the output classes, while introduces a non-linearity to the model. In the field of low-end FPGAs, implementations of Deep Neural Networks (DNNs) require the exploration of optimisation techniques to improve computational efficiency and hardware resource consumption. This work explores approximate computing techniques to implement the softmax function, using Taylor and Pad'e approximations, and interpolation methods with Look-Up Tables (LUTs). The introduction of approximations aims to reduce the required execution time while reducing the precision of results produced by the softmax function. Each implementation is evaluated using Root Mean Square Error (RMSE) for accuracy assessment, and individual performance is verified by taking measurements of execution times. From our evaluation, quadratic interpolation with LUTs achieves the lowest error, but in terms of performance, Taylor and Pad'e approximations show better execution times, which highlights the existing design trade-off between numerical accuracy and power consumption.