HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe memory bandwidth bottlenecks and low computational resource utilization in autoregressive decoding of large language models (LLMs), this work proposes the first heterogeneous in-memory computing (IMC) architecture synergistically integrating SRAM-based and HBM-based PIM. It innovatively offloads latency-sensitive attention computation to low-latency SRAM-PIM, while delegating weight-intensive GEMV operations to high-bandwidth HBM-PIM, and introduces a tightly coupled pipelined scheduling mechanism to overcome sequential dependency constraints. Through hardware-software co-optimization—including a domain-specific compiler framework, dynamic workload partitioning, and cross-PIM pipelined scheduling—the design significantly enhances on-chip parallelism. Cycle-accurate simulation demonstrates a 22.8× peak inference speedup over the NVIDIA A100 GPU, with superior throughput and scalability compared to state-of-the-art accelerators.

Technology Category

Application Category

📝 Abstract
The deployment of large language models (LLMs) presents significant challenges due to their enormous memory footprints, low arithmetic intensity, and stringent latency requirements, particularly during the autoregressive decoding stage. Traditional compute-centric accelerators, such as GPUs, suffer from severe resource underutilization and memory bandwidth bottlenecks in these memory-bound workloads. To overcome these fundamental limitations, we propose HPIM, the first memory-centric heterogeneous Processing-In-Memory (PIM) accelerator that integrates SRAM-PIM and HBM-PIM subsystems designed specifically for LLM inference. HPIM employs a software-hardware co-design approach that combines a specialized compiler framework with a heterogeneous hardware architecture. It intelligently partitions workloads based on their characteristics: latency-critical attention operations are mapped to the SRAM-PIM subsystem to exploit its ultra-low latency and high computational flexibility, while weight-intensive GEMV computations are assigned to the HBM-PIM subsystem to leverage its high internal bandwidth and large storage capacity. Furthermore, HPIM introduces a tightly coupled pipeline strategy across SRAM-PIM and HBM-PIM subsystems to maximize intra-token parallelism, thereby significantly mitigating serial dependency of the autoregressive decoding stage. Comprehensive evaluations using a cycle-accurate simulator demonstrate that HPIM significantly outperforms state-of-the-art accelerators, achieving a peak speedup of up to 22.8x compared to the NVIDIA A100 GPU. Moreover, HPIM exhibits superior performance over contemporary PIM-based accelerators, highlighting its potential as a highly practical and scalable solution for accelerating large-scale LLM inference.
Problem

Research questions and friction points this paper is trying to address.

Accelerating LLM inference with memory-centric heterogeneous PIM architecture
Overcoming memory bandwidth bottlenecks in autoregressive decoding workloads
Optimizing workload partitioning between SRAM-PIM and HBM-PIM subsystems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous PIM accelerator for LLM inference
SRAM-PIM and HBM-PIM subsystems integration
Software-hardware co-design with workload partitioning
C
Cenlin Duan
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, 100191, China, and State Key Laboratory of Spintronics, Hangzhou International Innovation Institute, Beihang University, Hangzhou 311115, China
Jianlei Yang
Jianlei Yang
Beihang University
Deep LearningComputer ArchitectureNueromorphic ComputingSpitronicsEDA/VLSI
Rubing Yang
Rubing Yang
University of Pennsylvania
Deep learningMachine perception
Yikun Wang
Yikun Wang
fudan university
Computer vision | Natural language processing
Y
Yiou Wang
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China, and Qingdao Research Institute, Beihang University, Qingdao 266104, China
L
Lingkun Long
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China, and Qingdao Research Institute, Beihang University, Qingdao 266104, China
Y
Yingjie Qi
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China, and Qingdao Research Institute, Beihang University, Qingdao 266104, China
X
Xiaolin He
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China, and Qingdao Research Institute, Beihang University, Qingdao 266104, China
A
Ao Zhou
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China, and Qingdao Research Institute, Beihang University, Qingdao 266104, China
X
Xueyan Wang
School of Integrated Circuit Science and Engineering, Beihang University, Beijing, 100191, China, and State Key Laboratory of Spintronics, Hangzhou International Innovation Institute, Beihang University, Hangzhou 311115, China
Weisheng Zhao
Weisheng Zhao
Fert Beijing Institute, Beihang University
Spintronics Devices and Integrated Circuits