Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing microscaling (MX) multiply-accumulate (MAC) designs face a fundamental trade-off between integer and floating-point accumulation: integer accumulation incurs costly floating-point-to-integer conversion overhead, while FP32 accumulation introduces quantization errors and high-cost normalization. This work targets continual-learning neural processing units (NPUs) and proposes a precision-scalable Microscaling data path. We introduce the first hybrid-precision reduction-tree MAC architecture that jointly preserves integer arithmetic accuracy and leverages floating-point dynamic range. The design natively supports narrow-bitwidth formats—including MXINT8, MXFP8/6, and MXFP4—and enables training-inference co-optimization via configurable accumulation strategies. Integrated into the SNAX NPU, the system achieves measured energy efficiency of 657–4065 GOPS/W and peak throughput of 512 GOPS, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

Problem

Research questions and friction points this paper is trying to address.

Optimizing MX MAC designs to overcome precision-accumulation trade-offs in NPUs

Developing hybrid reduction trees for efficient mixed-precision neural computations

Integrating precision-scalable datapaths into NPU platforms for enhanced energy efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid precision-scalable reduction tree for MX MACs

Integration into SNAX NPU platform for optimized datapaths

Achieves high energy efficiency across multiple MX formats

🔎 Similar Papers

Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip