LOCALUT: Harnessing Capacity-Computation Tradeoffs for LUT-Based Inference in DRAM-PIM

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of limited logic resources and inefficient low-bitwidth DNN inference in DRAM-based processing-in-memory (PIM) architectures by proposing a lookup table (LUT)-based operation packing method. The approach encodes multiple multiply-accumulate (MAC) operations into arithmetic-free LUTs, trading memory capacity for increased computational throughput. Key innovations include LUT normalization to eliminate redundancy, lightweight weight remapping via LUT reordering, and a streaming LUT slicing strategy that enhances memory efficiency and computational reuse. Evaluated on a real UPMEM PIM system across various numerical precisions and DNN models, the proposed method achieves a 1.82× geometric mean speedup over baseline approaches.
📝 Abstract
Lookup tables (LUTs) have recently gained attention as an alternative compute mechanism that maps input operands to precomputed results, eliminating the need for arithmetic logic. LUTs not only reduce logic complexity, but also naturally support diverse numerical precisions without requiring separate circuits for each bitwidth-an increasingly important feature in quantized DNNs. This creates a favorable tradeoff in PIM: memory capacity can be used in place of logic to increase computational throughput, aligning well with DRAM-PIM architectures that offer high bandwidth and easily available memory but limited logic density. In this work, we explore this capacity-computation tradeoff in LUT-based PIM designs, where memory capacity is traded for performance by packing multiple MAC operations into a single LUT lookup. Building on this insight, we propose LOCALUT, a PIM-based design for efficient low-bit quantized DNN inference using operation-packed LUTs. First, we observe that these LUTs contain extensive redundancy and introduce LUT canonicalization, which eliminates duplicate entries to reduce LUT size. Second, we propose reordering LUT, a lightweight auxiliary LUT that remaps weight vectors to their canonical form required by LUT canonicalization with a simple LUT lookup. Third, we propose LUT slice streaming, a novel execution strategy that exploits the DRAM-buffer hierarchy by streaming only relevant LUT columns into the buffer and reusing them across multiple weight vectors. Evaluated on a real system based on UPMEM devices, we demonstrate a geometric mean speedup of 1.82x across various numeric precisions and DNN models. We believe LOCALUT opens a path toward scalable, low-logic PIM designs tailored for LUT-based DNN inference. Our implementation of LOCALUT is available at https://github.com/AIS-SNU/LoCaLUT.
Problem

Research questions and friction points this paper is trying to address.

LUT-based inference
DRAM-PIM
capacity-computation tradeoff
quantized DNNs
memory capacity
Innovation

Methods, ideas, or system contributions that make the work stand out.

LUT-based inference
DRAM-PIM
capacity-computation tradeoff
LUT canonicalization
quantized DNN
🔎 Similar Papers
No similar papers found.
J
Junguk Hong
Department of Electrical and Computer Engineering, Seoul National University
C
Changmin Shin
Department of Electrical and Computer Engineering, Seoul National University
S
Sukjin Kim
Department of Electrical and Computer Engineering, Seoul National University
S
Si Ung Noh
Department of Electrical and Computer Engineering, Seoul National University
T
Taehee Kwon
Department of Electrical and Computer Engineering, Seoul National University
S
Seongyeon Park
Department of Electrical and Computer Engineering, Seoul National University
Hanjun Kim
Hanjun Kim
Professor, Yonsei University
Computer ArchitectureCompilerParallel Programming
Youngsok Kim
Youngsok Kim
Yonsei University
computer architecturesystem softwarehardware acceleration
Jinho Lee
Jinho Lee
Department of Electrical and Computer Engineering, Seoul National University
Computer architectureComputer systemsMachine learning