PQS (Prune, Quantize, and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In neural network quantized inference, conventional dot-product accumulation requires 32-bit accumulators to prevent overflow, resulting in high memory bandwidth and low energy efficiency. This work proposes PQS, the first framework to jointly integrate N:M floating-point pruning, ultra-low-bitweight/activation quantization (≤8 bits), and a “small-to-large” partial-sum ordering strategy—achieving algorithm–hardware co-design to eliminate accumulator overflow. By enabling extremely short accumulation chains, PQS reduces accumulator bit-width to just 12 bits—a 2.5× reduction—obviating the need for 32-bit accumulators. Evaluated across multiple image classification benchmarks, PQS maintains floating-point baseline accuracy while significantly improving memory bandwidth efficiency and system energy efficiency. The design exhibits strong hardware friendliness and practical deployability.

Technology Category

Application Category

📝 Abstract

We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. In conventional quantized (e.g., 8-bit) dot products, partial results are accumulated into wide (e.g., 32-bit) accumulators to avoid overflows when accumulating intermediate partial sums. However, such wide accumulators increase memory bandwidth usage and reduce energy efficiency. We show that iterative N:M pruning in floating point followed by quantization to 8 (or fewer) bits, and accumulation of partial products in a sorted order ("small to large") allows for accurate, compressed models with short dot product lengths that do not require wide accumulators. We design, analyze, and implement the PQS algorithm to eliminate accumulation overflows at inference time for several neural networks. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Reducing accumulator bitwidth in neural network computations

Preventing overflow in low-bitwidth dot product accumulation

Maintaining accuracy while compressing neural network models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative N:M pruning in floating point

Quantization to 8 or fewer bits

Sorted accumulation of partial products

🔎 Similar Papers

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip