π€ AI Summary
To address area and energy-efficiency bottlenecks in multi-format (INT/FP/MX) multiply-accumulate (MAC) operations for AI accelerators, this paper proposes Jack Unitβa highly integrated, precision-configurable MAC architecture. Methodologically, it unifies two key innovations: (1) the first hardware integration of a precision-scalable carry-save multiplier (CSM) with an exponent-difference-driven dynamic mantissa alignment mechanism; and (2) a two-dimensional subword parallelism scheme enabling cross-format computational resource reuse. Custom layout optimization yields, at 40 nm, 1.17β2.01Γ area reduction and 1.05β1.84Γ power reduction versus baseline MAC units. When integrated into an AI accelerator, Jack Unit achieves 1.32β5.41Γ energy-efficiency improvement across five benchmark workloads, significantly enhancing hardware utilization and format adaptability.
π Abstract
In this work, we introduce an area- and energy-efficient multiply-accumulate (MAC) unit, named Jack unit, that is a jack-of-all-trades, supporting various data formats such as integer (INT), floating point (FP), and microscaling data format (MX). It provides bit-level flexibility and enhances hardware efficiency by i) replacing the carry-save multiplier (CSM) in the FP multiplier with a precision-scalable CSM, ii) performing the adjustment of significands based on the exponent differences within the CSM, and iii) utilizing 2D sub-word parallelism. To assess effectiveness, we implemented the layout of the Jack unit and three baseline MAC units. Additionally, we designed an AI accelerator equipped with our Jack units to compare with a state-of-the-art AI accelerator supporting various data formats. The proposed MAC unit occupies 1.17~2.01x smaller area and consumes 1.05~1.84x lower power compared to the baseline MAC units. On five AI benchmarks, the accelerator designed with our Jack units improves energy efficiency by 1.32~5.41x over the baseline across various data formats.