🤖 AI Summary
Existing computing platforms lack native support for 1.58-bit ternary weights, relying instead on inefficient dequantization, and the design space for lookup table (LUT)-based accelerators remains underexplored and lacks systematic, fair evaluation. This work formally defines, for the first time, the LUT accelerator design space tailored to 1.58-bit large language model inference, introducing an open-source hardware generator and an analytical area/performance cost model to enable rapid architectural exploration and equitable comparison. The study reveals that the activation data type predominantly governs optimal architecture selection, overturning several prevailing assumptions in the literature regarding weight reuse strategies and core partitioning. Implemented in TSMC 16nm technology, the optimized architecture reduces area by 2.2× compared to a multiplier-based baseline, with a further 1.2× improvement in area efficiency achieved after correcting suboptimal parameters.
📝 Abstract
Ternary weight quantization (e.g., BitNet b1.58) offers a promising path to mitigate the memory bandwidth bottleneck in Large Language Model (LLM) inference. However, conventional compute platforms lack native support for ternary-weight arithmetic, often relying on inefficient dequantization. Lookup table (LUT)-based hardware architectures provide an effective alternative by replacing multiplications with conditional additions, but their design space remains largely unexplored. Existing designs rely on heuristic parameter selection, lacking a systematic understanding of the architectural trade-offs. This work addresses this gap by formalizing the design space of ternary LUT-based accelerators and presenting an open-source hardware generator coupled with an analytical cost model, validated against synthesis in TSMC 16nm technology. By spanning the full architectural space, this framework not only enables rapid design space exploration but also establishes a common footing for fair cross-design evaluation, which was previously hindered by inconsistent instantiations across published accelerators. Using this framework, we challenge several assumptions and design choices in recent literature. We demonstrate that the optimal architecture is fundamentally governed by the activation data type: while LUT-based reuse offers significant gains for high-cost arithmetic (e.g., FP16), it yields diminishing returns for small integer types. Furthermore, we show that maximizing core size consistently improves area density compared to highly tiled approaches. Our optimized designs achieve a 2.2x area reduction compared to multiplier-based baselines. Moreover, by benchmarking state-of-the-art implementations against our model, we reveal that correcting suboptimal parameters yields up to a 1.2x area improvement.