StruM: Structured Mixed Precision for Efficient Deep Learning Hardware Codesign

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address the high computational/storage overhead and energy consumption of deep learning inference on hardware, this paper proposes StruM, a structured mixed-precision quantization method, co-designed with a dedicated data-processing unit (DPU). StruM leverages weight magnitude distributions to perform structured block-wise dual-precision quantization within each layer—requiring no retraining or specialized hardware support. It achieves, under static configuration, a 23–26% reduction in processing-element (PE) area and a 2–3% reduction in overall DPU area. For CNN models, it effectively compresses 8-bit weights to an equivalent 4-bit representation with negligible accuracy degradation. Moreover, PE power consumption is reduced by 31–34%, and accelerator total area decreases by 10%, significantly improving inference energy efficiency. The approach is fully compatible with the FlexNN architecture and natively supports 4-/8-bit low-bit integer arithmetic.

Technology Category

Application Category

📝 Abstract

In this paper, we propose StruM, a novel structured mixed-precision-based deep learning inference method, co-designed with its associated hardware accelerator (DPU), to address the escalating computational and memory demands of deep learning workloads in data centers and edge applications. Diverging from traditional approaches, our method avoids time-consuming re-training/fine-tuning and specialized hardware access. By leveraging the variance in weight magnitudes within layers, we quantize values within blocks to two different levels, achieving up to a 50% reduction in precision for 8-bit integer weights to 4-bit values across various Convolutional Neural Networks (CNNs) with negligible loss in inference accuracy. To demonstrate efficiency gains by utilizing mixed precision, we implement StruM on top of our in-house FlexNN DNN accelerator [1] that supports low and mixed-precision execution. Experimental results depict that the proposed StruM-based hardware architecture achieves a 31-34% reduction in processing element (PE) power consumption and a 10% reduction in area at the accelerator level. In addition, the statically configured StruM results in 23-26% area reduction at the PE level and 2-3% area savings at the DPU level.

Problem

Research questions and friction points this paper is trying to address.

Deep Learning

Hardware Efficiency

Energy Consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

StruM

mixed-precision computing

hardware efficiency

🔎 Similar Papers

SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity