Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high lookup table (LUT) construction overhead and low efficiency of bit-serial computation in ultra-low-bit neural network inference, this paper proposes a path-adaptive LUT acceleration architecture. The architecture significantly reduces hardware overhead by generating construction paths offline and supports dynamic switching between generic bit-serial and ternary-weight computation modes, enabling mixed-precision generalized matrix multiplication (GEMM). It presents the first ASIC implementation of a configurable dual-mode compute unit, achieving deep optimization for ternary networks without sacrificing generality. Evaluated on the BitNet b1.58-3B model, the design achieves up to 73.6× speedup and 32.4× energy efficiency improvement over SpikingEyeriss, Prosperity, and a 16-thread T-MAC baseline, while occupying only 0.96 mm² of silicon area.

Technology Category

Application Category

📝 Abstract
The rapid scaling of large language models demands more efficient hardware. Quantization offers a promising trade-off between efficiency and performance. With ultra-low-bit quantization, there are abundant opportunities for results reuse, and thus it can be boosted with lookup tables (LUTs) based acceleration. However, existing LUT-based methods suffer from computation and hardware overheads for LUT construction, and rely solely on bit-serial computation, which is suboptimal for ternary-weight networks. We propose Platinum, a lightweight ASIC accelerator for integer weight mixed-precision matrix multiplication (mpGEMM) using LUTs. Platinum reduces LUT construction overhead via offline-generated construction paths and supports both general bit-serial and optimized ternary-weight execution through adaptive path switching. On BitNet b1.58-3B, Platinum achieves up to 73.6x, 4.09x, and 2.15x speedups over SpikingEyeriss, Prosperity, and 16-thread T-MAC (CPU), respectively, along with energy reductions of 32.4x, 3.23x, and 20.9x, all within a 0.96mm2 chip area. This demonstrates the potential of LUT-based ASICs as efficient, scalable solutions for ultra-low-bit neural networks on edge platforms.
Problem

Research questions and friction points this paper is trying to address.

Accelerates low-bit weight matrix multiplication with LUTs
Reduces LUT construction overhead via offline paths
Supports bit-serial and ternary-weight execution adaptively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline-generated paths reduce LUT construction overhead
Adaptive path switching supports bit-serial and ternary execution
LUT-based ASIC accelerator for low-bit matrix multiplication
🔎 Similar Papers
No similar papers found.
Haoxuan Shan
Haoxuan Shan
Department of Electrical and Computer Engineering, Duke University
C
Cong Guo
Department of Electrical and Computer Engineering, Duke University
Chiyue Wei
Chiyue Wei
Ph.D. student at ECE, Duke University
Computer ArchitectureDeep Learning
F
Feng Cheng
Department of Electrical and Computer Engineering, Duke University
J
Junyao Zhang
Department of Electrical and Computer Engineering, Duke University
H
Hai Li
Department of Electrical and Computer Engineering, Duke University
Y
Yiran Chen
Department of Electrical and Computer Engineering, Duke University