Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the inefficiency of existing digital compute-in-memory (DCIM) architectures in supporting diverse FP8 formats due to their fixed-precision MAC units and uniform alignment strategies. To overcome this limitation, the authors propose a flexible FP8 DCIM accelerator that dynamically predicts input distributions and adaptively adjusts the alignment precision of weights and inputs, achieving significant energy efficiency gains without compromising model accuracy. Key innovations include Dynamic Shift-aware Bitwidth Prediction (DSBP), a FIFO-pointer-based Input Alignment Unit (FIAU), and a precision-scalable integer MAC array. Implemented in 28nm CMOS, the 64×96 CIM chip achieves 20.4 TFLOPS/W under the E5M7 format—2.8× higher than prior art—and demonstrates superior efficiency at iso-accuracy on Llama-7B across BoolQ and Winogrande benchmarks, enabling flexible trade-offs between precision and energy efficiency.

Technology Category

Application Category

📝 Abstract

FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2$\sim$12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64$\times$96 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8$\times$ higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.

Problem

Research questions and friction points this paper is trying to address.

FP8

compute-in-memory

bitwidth adaptation

precision efficiency trade-off

aligned-mantissa

Innovation

Methods, ideas, or system contributions that make the work stand out.

FP8

Compute-in-Memory

Dynamic Bitwidth Prediction