Pushing the Limits of BFP on Narrow Precision LLM Inference

πŸ“… 2025-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Nonlinear operators (e.g., Softmax, Attention) in large language model (LLM) inference suffer from quadratic computational complexity and inefficient floating-point arithmetic, forming a critical performance bottleneck. Method: This paper presents DB-Attn, the first hardware-software co-design framework enabling block floating-point (BFP) arithmetic for nonlinear computations. It introduces Dynamic Block Floating-Point (DBFP), integrating pivot-focus representation with adaptive grouping, alongside a DH-LUT lookup-table acceleration algorithm and an RTL-level dedicated compute engine deployable on FPGA or ASIC. Contribution/Results: DB-Attn breaks the conventional limitation of BFP to linear operations, enabling narrow-precision, high-efficiency nonlinear execution. Evaluated on LLaMA’s Softmax, it achieves 74% speedup over GPU baselines, reduces hardware overhead by 10Γ— versus state-of-the-art accelerators, and incurs negligible accuracy loss.

Technology Category

Application Category

πŸ“ Abstract
The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and an adaptive grouping strategy for flexible exponent sharing. (ii) DH-LUT, a novel lookup table algorithm dedicated to accelerating nonlinear operations with DBFP format. (iii) An RTL-level DBFP-based engine is implemented to support DB-Attn, applicable to FPGA and ASIC. Results show that DB-Attn provides significant performance improvements with negligible accuracy loss, achieving 74% GPU speedup on Softmax of LLaMA and 10x low overhead performance improvement over SOTA designs.
Problem

Research questions and friction points this paper is trying to address.

Optimize nonlinear operations in LLMs
Enhance BFP for complex computations
Develop efficient hardware-software co-design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advanced BFP for nonlinear operations
Novel lookup table algorithm
RTL-level DBFP-based engine
πŸ”Ž Similar Papers
No similar papers found.
H
Hui Wang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
Y
Yuan Cheng
Houmo AI, Nanjing University
Xiaomeng Han
Xiaomeng Han
Southeast University
LLMs Accelerator
Z
Zhengpeng Zhao
Huazhong University of Science and Technology
D
Dawei Yang
Houmo AI
Z
Zhe Jiang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University