FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

๐Ÿ“… 2024-11-27
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current hardware supports only fixed-power-of-two floating-point precisions (e.g., FP8/FP16), limiting adaptability to the mixed-precision requirements of large language models. This restricts computational efficiency for sub-FP8 formats (e.g., FP5/FP6) and necessitates costly hardware redesign for each new precision. To address this, we propose a fully flexible bit-parallel architectureโ€”the first to enable efficient computation across arbitrary non-power-of-two floating-point and integer precisions and formats, achieving complete decoupling among precision, data format, and compute units. Our design comprises three key components: a reconfigurable bit-parallel array, a dynamic precision routing network, and a unified floating-point/integer decoder. Evaluated on GPT-3 FP6 inference, our architecture delivers 1.66ร— and 1.62ร— higher performance-per-area than NVIDIA Tensor Cores and BitFusion, respectively, and outperforms state-of-the-art bit-serial architectures by 3.9ร—.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent research has shown that large language models (LLMs) can utilize low-precision floating point (FP) quantization to deliver high efficiency while maintaining original model accuracy. In particular, recent works have shown the effectiveness of non-power-of-two precisions, such as FP6 and FP5, and diverse sensitivity to low-precision arithmetic of LLM layers, which motivates mixed precision arithmetic including non-power-of-two precisions in LLMs. Although low-precision algorithmically leads to low computational overheads, such benefits cannot be fully exploited due to hardware constraints that support a limited set of power-of-two precisions (e.g., FP8, 16, 32, and 64 in NVIDIA H100 Tensor Core). In addition, the hardware compute units are designed to support standard formats (e.g., E4M3 and E5M2 for FP8). Such practices require re-designing the hardware whenever new precision and format emerge, which leads to high hardware replacement costs to exploit the benefits of new precisions and formats. Therefore, in this paper, we propose a new accelerator architecture, FlexiBit, which efficiently supports FP and INT arithmetic in arbitrary precisions and formats. Unlike previous bit-serial designs, which also provide flexibility but at the cost of performance due to its bit-wise temporal processing nature, FlexiBit's architecture enables bit-parallel processing of any precision and format without compute unit underutilization. FlexiBit's new capability to exploit non-power of two precision and format led to 1.66x and 1.62x higher performance per area on GPT-3 in FP6 targeting a cloud-scale accelerator, compared to a Tensor Core-like architecture and a state-of-the-art bit-parallel flexible precision accelerator, BitFusion, respectively. Also, the bit-parallel nature of FlexiBit's architecture led to 3.9x higher performance/area compared to a state-of-the-art bit-serial architecture.
Problem

Research questions and friction points this paper is trying to address.

Supports arbitrary mixed precision AI arithmetic efficiently
Overcomes hardware constraints limiting non-power-of-two precisions
Eliminates compute unit underutilization in flexible precision processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlexiBit enables arbitrary precision and format arithmetic
Bit-parallel processing avoids compute unit underutilization
Supports non-power-of-two precisions for higher efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
Faraz Tahmasebi
Faraz Tahmasebi
Ph.D. Student, University of California, Irvine
ML AcceleratorsLarge Language ModelsASIC DesignNetwork-on-Chip
Y
Yian Wang
University of California, Irvine, Electrical Engineering and Computer Science, Irvine, CA, USA
B
Benji Y.H. Huang
University of California, Irvine, Electrical Engineering and Computer Science, Irvine, CA, USA
Hyoukjun Kwon
Hyoukjun Kwon
Assistant Professor, University of California, Irvine
Computer ArchitectureDeep Learning AcceleratorNetwork-on-Chip (NoC)Machine LearningDeep Neural Network