🤖 AI Summary
To address performance and resource bottlenecks in quantized neural network (QNN) inference on edge devices—caused by model complexity—this paper proposes a hardware-efficient, unstructured sparsity-aware acceleration framework for FPGAs, requiring no dedicated sparse execution units. The method features: (i) a hardware-aware fine-grained pruning strategy co-optimized with quantization and dataflow architecture; and (ii) a restructured sparse data layout and memory access pattern that eliminates irregular memory accesses while preserving high parallelism. Evaluated on LeNet-5, the framework achieves 51.6× model compression and 1.23× throughput improvement, consuming only 5.12% of LUT resources. It significantly enhances energy efficiency and hardware utilization. This work establishes a lightweight, general-purpose, and hardware-friendly paradigm for deploying QNNs under severe resource constraints.
📝 Abstract
FPGAs have been shown to be a promising platform for deploying Quantised Neural Networks (QNNs) with high-speed, low-latency, and energy-efficient inference. However, the complexity of modern deep-learning models limits the performance on resource-constrained edge devices. While quantisation and pruning alleviate these challenges, unstructured sparsity remains underexploited due to irregular memory access. This work introduces a framework that embeds unstructured sparsity into dataflow accelerators, eliminating the need for dedicated sparse engines and preserving parallelism. A hardware-aware pruning strategy is introduced to improve efficiency and design flow further. On LeNet-5, the framework attains 51.6 x compression and 1.23 x throughput improvement using only 5.12% of LUTs, effectively exploiting unstructured sparsity for QNN acceleration.