Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction

πŸ“… 2026-02-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the inefficiency of existing digital compute-in-memory (DCIM) architectures in supporting diverse FP8 formats due to their fixed-precision MAC units and uniform alignment strategies. To overcome this limitation, the authors propose a flexible FP8 DCIM accelerator that dynamically predicts input distributions and adaptively adjusts the alignment precision of weights and inputs, achieving significant energy efficiency gains without compromising model accuracy. Key innovations include Dynamic Shift-aware Bitwidth Prediction (DSBP), a FIFO-pointer-based Input Alignment Unit (FIAU), and a precision-scalable integer MAC array. Implemented in 28nm CMOS, the 64Γ—96 CIM chip achieves 20.4 TFLOPS/W under the E5M7 formatβ€”2.8Γ— higher than prior artβ€”and demonstrates superior efficiency at iso-accuracy on Llama-7B across BoolQ and Winogrande benchmarks, enabling flexible trade-offs between precision and energy efficiency.

Technology Category

Application Category

πŸ“ Abstract
FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2$\sim$12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64$\times$96 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8$\times$ higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.
Problem

Research questions and friction points this paper is trying to address.

FP8
compute-in-memory
bitwidth adaptation
precision efficiency trade-off
aligned-mantissa
Innovation

Methods, ideas, or system contributions that make the work stand out.

FP8
Compute-in-Memory
Dynamic Bitwidth Prediction
Input Alignment
Precision-Scalable MAC
πŸ”Ž Similar Papers
No similar papers found.
L
Liang Zhao
South China University of Technology, Guangzhou, China
K
Kunming Shao
The Hong Kong University of Science and Technology, Hong Kong SAR, China
Zhipeng Liao
Zhipeng Liao
Professor, Department of Economics, UCLA
economicseconometrics
Xijie Huang
Xijie Huang
Hong Kong University of Science and Technology
Efficient Deep LearningModel Compression
T
Tim Kwang-Ting Cheng
The Hong Kong University of Science and Technology, Hong Kong SAR, China
C
Chi-Ying Tsui
The Hong Kong University of Science and Technology, Hong Kong SAR, China
Yi Zou
Yi Zou
Intel Labs
Near-data and in-memory computingComputer Architecture and Computer SystemsNon-volatile storagedistributed storagebig da