AME-PIM: Can Memory be Your Next Tensor Accelerator?

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
Existing HBM-PIM platforms struggle to support general-purpose tensor acceleration due to restrictive instruction sets and inefficient handling of general matrix operations. To address this, this work proposes the PEP execution model grounded in RISC-V AME semantics, which efficiently maps element-wise and matrix instructions onto HBM-PIM microkernels. It introduces a novel outer-product dataflow architecture that eliminates the need for explicit reduction operations, enabling full in-memory accumulation even on HBM-PIM systems lacking native reduction support. This design comprehensively supports GEMM, GEMV, and element-wise operations. Evaluated on a single pseudo-channel of Samsung Aquabolt-XL HBM, the approach achieves 14.9 GFLOP/s (59.4 FLOP/cycle) for tiled matrix multiplication, substantially improving end-to-end PIM execution efficiency.
📝 Abstract
High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require specialized software stacks. In this work, we investigate whether HBM-PIM can serve as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as a semantic reference. We propose a PEP-based execution model that maps AME element-wise and matrix instructions to HBM-PIM micro-kernels and data instructions in memory operations. Differently from SoA HBM-PIM, we introduce a reduction-free outer-product dataflow that enables accumulation entirely within memory despite the lack of native reduction support. Our approach supports end-to-end execution of element-wise operations, GEMV, and GEMM in PIM mode, minimizing host involvement and off-chip transfers. An experimental evaluation on Samsung Aquabolt-XL shows that AME matrix tile multiplication achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single HBM pseudo-channel.
Problem

Research questions and friction points this paper is trying to address.

Processing-in-Memory
Matrix Acceleration
HBM-PIM
Instruction Set Architecture
Data Movement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Processing-in-Memory
HBM-PIM
RISC-V AME
reduction-free dataflow
matrix acceleration
🔎 Similar Papers
2024-02-26Proceedings of the ACM on Measurement and Analysis of Computing SystemsCitations: 4