CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device

📅 2026-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bandwidth bottleneck that limits large language model (LLM) inference on edge devices under low-batch settings, where existing processing-in-memory (PIM) architectures suffer from insufficient bandwidth gains, low resource utilization, and inadequate compute capability. To overcome these challenges, the authors propose CD-PIM, a novel PIM architecture tailored for edge-based low-batch LLM acceleration, built upon LPDDR5. CD-PIM introduces a high-bandwidth compute-efficient mode (HBCEM), a low-batch interleaved execution mechanism (LBIM), pipelined compute units, and a hybrid row-column mapping strategy for key-value caches. Through pseudo-bank construction, overlapped scheduling of GEMV/GEMM operations, and serial weight input, the design significantly enhances bandwidth utilization and resource efficiency. Experiments show that CD-PIM achieves an average speedup of 11.42× over GPU baselines and 4.25× over state-of-the-art PIM approaches under single-batch inference, with LBIM further delivering a 1.12× performance gain in low-batch scenarios.

Technology Category

Application Category

📝 Abstract
Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42x and 4.25x speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12x in LBIM compared to HBCEM.
Problem

Research questions and friction points this paper is trying to address.

memory bandwidth bottleneck
large language models
processing-in-memory
GEMV
edge-device
Innovation

Methods, ideas, or system contributions that make the work stand out.

Processing-in-Memory
LPDDR5
GEMV acceleration
Low-batch LLM
Compute-efficient architecture
🔎 Similar Papers
No similar papers found.
Y
Ye Lin
School of Electronic Science and Engineering, Nanjing University, China
Chao Fang
Chao Fang
Shanghai Qi Zhi Institute
efficient MLAI acceleratorhardware-software co-designprecision-scalable computingRISC-V
X
Xiaoyong Song
China Mobile Research Institute, China
Q
Qi Wu
School of Electronic Science and Engineering, Nanjing University, China
A
Anying Jiang
School of Electronic Science and Engineering, Nanjing University, China
Y
Yichuan Bai
School of Electronic Science and Engineering, Nanjing University, China
L
Li Du
School of Electronic Science and Engineering, Nanjing University, China