HaShiFlex: A High-Throughput Hardened Shifter DNN Accelerator with Fine-Tuning Flexibility

πŸ“… 2025-12-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address data movement bottlenecks and memory pressure in edge continuous sensing and datacenter inference, this work proposes a high-throughput, low-power DNN accelerator architecture supporting post-deployment fine-tuning of classification layers. Methodologically, it introduces: (1) a novel power-of-two (Po2) quantization-driven purely additive convolutional architecture, replacing multiplications with bit-shifts to enable fully hardware-accelerated computation; and (2) ASIC-level hardwired design with embedded reconfigurable final layers, balancing peak throughput and fine-grained adaptability. Implemented in 7nm CMOS and evaluated on MobileNetV2, the accelerator achieves 1.21M images/sβ€”20Γ— faster than a GPUβ€”while preserving full accuracy. With no fine-tuning required, throughput scales to 4M images/s (67Γ— GPU), significantly reducing area and energy consumption.

Technology Category

Application Category

πŸ“ Abstract
We introduce a high-throughput neural network accelerator that embeds most network layers directly in hardware, minimizing data transfer and memory usage while preserving a degree of flexibility via a small neural processing unit for the final classification layer. By leveraging power-of-two (Po2) quantization for weights, we replace multiplications with simple rewiring, effectively reducing each convolution to a series of additions. This streamlined approach offers high-throughput, energy-efficient processing, making it highly suitable for applications where model parameters remain stable, such as continuous sensing tasks at the edge or large-scale data center deployments. Furthermore, by including a strategically chosen reprogrammable final layer, our design achieves high throughput without sacrificing fine-tuning capabilities. We implement this accelerator in a 7nm ASIC flow using MobileNetV2 as a baseline and report throughput, area, accuracy, and sensitivity to quantization and pruning - demonstrating both the advantages and potential trade-offs of the proposed architecture. We find that for MobileNetV2, we can improve inference throughput by 20x over fully programmable GPUs, processing 1.21 million images per second through a full forward pass while retaining fine-tuning flexibility. If absolutely no post-deployment fine tuning is required, this advantage increases to 67x at 4 million images per second.
Problem

Research questions and friction points this paper is trying to address.

Designs a high-throughput DNN accelerator with hardware-embedded layers
Uses Po2 quantization to replace multiplications with additions for efficiency
Maintains fine-tuning flexibility via a reprogrammable final classification layer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardware-embedded layers minimize data transfer and memory usage
Power-of-two quantization replaces multiplications with rewiring and additions
Reprogrammable final layer maintains fine-tuning flexibility with high throughput
πŸ”Ž Similar Papers
No similar papers found.