🤖 AI Summary
This work addresses the memory bandwidth bottleneck that limits large language model (LLM) inference on edge devices under low-batch settings, where existing processing-in-memory (PIM) architectures suffer from insufficient bandwidth gains, low resource utilization, and inadequate compute capability. To overcome these challenges, the authors propose CD-PIM, a novel PIM architecture tailored for edge-based low-batch LLM acceleration, built upon LPDDR5. CD-PIM introduces a high-bandwidth compute-efficient mode (HBCEM), a low-batch interleaved execution mechanism (LBIM), pipelined compute units, and a hybrid row-column mapping strategy for key-value caches. Through pseudo-bank construction, overlapped scheduling of GEMV/GEMM operations, and serial weight input, the design significantly enhances bandwidth utilization and resource efficiency. Experiments show that CD-PIM achieves an average speedup of 11.42× over GPU baselines and 4.25× over state-of-the-art PIM approaches under single-batch inference, with LBIM further delivering a 1.12× performance gain in low-batch scenarios.
📝 Abstract
Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42x and 4.25x speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12x in LBIM compared to HBCEM.