π€ AI Summary
This work addresses the inefficiency of existing digital compute-in-memory (DCIM) architectures in supporting diverse FP8 formats due to their fixed-precision MAC units and uniform alignment strategies. To overcome this limitation, the authors propose a flexible FP8 DCIM accelerator that dynamically predicts input distributions and adaptively adjusts the alignment precision of weights and inputs, achieving significant energy efficiency gains without compromising model accuracy. Key innovations include Dynamic Shift-aware Bitwidth Prediction (DSBP), a FIFO-pointer-based Input Alignment Unit (FIAU), and a precision-scalable integer MAC array. Implemented in 28nm CMOS, the 64Γ96 CIM chip achieves 20.4 TFLOPS/W under the E5M7 formatβ2.8Γ higher than prior artβand demonstrates superior efficiency at iso-accuracy on Llama-7B across BoolQ and Winogrande benchmarks, enabling flexible trade-offs between precision and energy efficiency.
π Abstract
FP8 low-precision formats have gained significant adoption in Transformer inference and training. However, existing digital compute-in-memory (DCIM) architectures face challenges in supporting variable FP8 aligned-mantissa bitwidths, as unified alignment strategies and fixed-precision multiply-accumulate (MAC) units struggle to handle input data with diverse distributions. This work presents a flexible FP8 DCIM accelerator with three innovations: (1) a dynamic shift-aware bitwidth prediction (DSBP) with on-the-fly input prediction that adaptively adjusts weight (2/4/6/8b) and input (2$\sim$12b) aligned-mantissa precision; (2) a FIFO-based input alignment unit (FIAU) replacing complex barrel shifters with pointer-based control; and (3) a precision-scalable INT MAC array achieving flexible weight precision with minimal overhead. Implemented in 28nm CMOS with a 64$\times$96 CIM array, the design achieves 20.4 TFLOPS/W for fixed E5M7, demonstrating 2.8$\times$ higher FP8 efficiency than previous work while supporting all FP8 formats. Results on Llama-7b show that the DSBP achieves higher efficiency than fixed bitwidth mode at the same accuracy level on both BoolQ and Winogrande datasets, with configurable parameters enabling flexible accuracy-efficiency trade-offs.