RACAM: Enhancing DRAM with Reuse-Aware Computation and Automated Mapping for ML Inference

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

DRAM-based Processing-in-Memory (PIM) architectures face three critical bottlenecks in large language model (LLM) inference: poor data reuse, excessive redundant memory accesses, and inadequate hardware-software mapping support. To address these, this work proposes the first DRAM-in-memory computing architecture supporting bit-serial computation and fine-grained data reuse. It integrates multi-level locality buffers, a global broadcast network, and a configurable bit-serial processing element (PE) array to co-optimize data locality and DRAM’s inherent parallelism at the hardware level. Furthermore, we design a constraint-solving–based automated mapping compiler that jointly optimizes inter-layer LLM dataflows and DRAM’s physical topology—marking the first such approach. Evaluation shows speedups of 9×–102× over GPU acceleration and a 233× improvement in area-normalized performance over the state-of-the-art DRAM-PIM system Proteus for GPT-3 inference.

Technology Category

Application Category

📝 Abstract

In-DRAM Processing-In-Memory (DRAM-PIM) has emerged as a promising approach to accelerate memory-intensive workloads by mitigating data transfer overhead between DRAM and the host processor. Bit-serial DRAM-PIM architectures, further enhance efficiency by supporting runtime variable data precision, which is critical for emerging workloads, such as large language model (LLM) inference. However, existing works still have major limitations: lack of data reuse, significant amounts of redundant data transfer, and insufficient support for workload mapping. To address these issues, we propose RACAM, the first in-DRAM bit-serial architecture which uses dedicated locality buffers, bit-serial PEs, popcount reduction units and broadcast units to enable data reuse and alleviate redundant data transfers. Furthermore, a workload mapping mechanism is proposed to fully explore the massive parallelism of DRAM architecture and identify the best mapping scheme of a given workload. We evaluate RACAM against GPUs and the state-of-the-art, in-DRAM PIM system, Proteus, across end-to-end LLM inferences. RACAM achieves 9x to 102x speedup over GPUs and 233x higher performance per mm2 compared to Proteus in case of GPT3.

Problem

Research questions and friction points this paper is trying to address.

Addresses lack of data reuse in DRAM-PIM for ML inference

Reduces redundant data transfers in bit-serial DRAM architectures

Improves workload mapping to exploit DRAM parallelism efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bit-serial DRAM-PIM with locality buffers for data reuse

Workload mapping mechanism to exploit DRAM parallelism

Popcount reduction and broadcast units to reduce data transfer

🔎 Similar Papers

No similar papers found.