🤖 AI Summary
DRAM-based Processing-in-Memory (PIM) architectures face three critical bottlenecks in large language model (LLM) inference: poor data reuse, excessive redundant memory accesses, and inadequate hardware-software mapping support. To address these, this work proposes the first DRAM-in-memory computing architecture supporting bit-serial computation and fine-grained data reuse. It integrates multi-level locality buffers, a global broadcast network, and a configurable bit-serial processing element (PE) array to co-optimize data locality and DRAM’s inherent parallelism at the hardware level. Furthermore, we design a constraint-solving–based automated mapping compiler that jointly optimizes inter-layer LLM dataflows and DRAM’s physical topology—marking the first such approach. Evaluation shows speedups of 9×–102× over GPU acceleration and a 233× improvement in area-normalized performance over the state-of-the-art DRAM-PIM system Proteus for GPT-3 inference.
📝 Abstract
In-DRAM Processing-In-Memory (DRAM-PIM) has emerged as a promising approach to accelerate memory-intensive workloads by mitigating data transfer overhead between DRAM and the host processor. Bit-serial DRAM-PIM architectures, further enhance efficiency by supporting runtime variable data precision, which is critical for emerging workloads, such as large language model (LLM) inference. However, existing works still have major limitations: lack of data reuse, significant amounts of redundant data transfer, and insufficient support for workload mapping. To address these issues, we propose RACAM, the first in-DRAM bit-serial architecture which uses dedicated locality buffers, bit-serial PEs, popcount reduction units and broadcast units to enable data reuse and alleviate redundant data transfers. Furthermore, a workload mapping mechanism is proposed to fully explore the massive parallelism of DRAM architecture and identify the best mapping scheme of a given workload. We evaluate RACAM against GPUs and the state-of-the-art, in-DRAM PIM system, Proteus, across end-to-end LLM inferences. RACAM achieves 9x to 102x speedup over GPUs and 233x higher performance per mm2 compared to Proteus in case of GPT3.