LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

📅 2025-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the severe accuracy degradation and hardware implementation challenges caused by ultra-low-bit quantization (<1-bit scalar quantization) in deep learning inference, this paper proposes a hardware-co-design acceleration framework based on vector quantization and lookup table (LUT) mapping. We introduce LUTBoost, a novel multi-stage training algorithm that jointly optimizes model architecture (across CNNs and Transformers) and hardware parameters in a LUT-driven manner, enabling efficient conversion of DNNs into LUT-native models. Our approach preserves model accuracy while achieving 1.4–7.0× improvement in energy efficiency and 1.5–146.1× in area efficiency over state-of-the-art implementations. Accuracy loss remains bounded at only 0.1–3.8% for CNNs and 1.4–3.0% for Transformers—significantly outperforming existing ultra-low-bit quantization methods.

Technology Category

Application Category

📝 Abstract
The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of $1.4$~$7.0 imes$ and $1.5$~$146.1 imes$, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by $0.1%$~$3.1%$ using the $L_2$ distance similarity, $0.1%$~$3.4%$ with the $L_1$ distance similarity, and $0.1%$~$3.8%$ when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from $1.4%$ to $3.0%$.
Problem

Research questions and friction points this paper is trying to address.

Low-precision Computing
Deep Learning Models
Hardware Cost Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

LUT-DLA
low-bit operations
hardware accelerator
🔎 Similar Papers
No similar papers found.
G
Guoyu Li
University of Chinese Academy of Sciences
S
Shengyu Ye
Microsoft Research
Chunyun Chen
Chunyun Chen
NTU Singapore
Y
Yang Wang
Microsoft Research
F
Fan Yang
Microsoft Research
T
Ting Cao
Microsoft Research
C
Cheng Liu
University of Chinese Academy of Sciences
Mohamed M. Sabry Aly
Mohamed M. Sabry Aly
Associate Professor, Nanyang Technological University
Computer ArchitectureAI HardwareSystem-Level DesignEmerging TechnologiesEmbedded Systems
M
Mao Yang
Microsoft Research