L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

πŸ“… 2025-04-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the dual memory capacity and bandwidth bottlenecks imposed by KV caching in long-context LLM inference, this work proposes a heterogeneous in-memory computing (PIM) architecture co-integrating GPU and DIMM-level PIM. Focusing on multi-head attention computation during decoding, it introduces a hardware-level redesign to resolve the mismatch between PIM data layout and computational unitsβ€”a first in the literature. The design enables efficient heterogeneous parallelism via three key techniques: communication-computation overlap, hierarchical KV cache offloading across memory tiers, and adaptive cross-device scheduling. Evaluated on real-world LLM inference traces, the architecture achieves up to 6.1Γ— speedup over state-of-the-art HBM-based PIM approaches, while significantly increasing feasible batch sizes. Results demonstrate both scalability and practicality for production-grade long-context inference.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth. Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1$ imes$ speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes.
Problem

Research questions and friction points this paper is trying to address.

Overcoming GPU memory limits for long-context LLM inference
Reducing data swapping overhead in host-side DIMM offloading
Optimizing multi-head attention decoding phase bottlenecks
Innovation

Methods, ideas, or system contributions that make the work stand out.

DIMM-PIM and GPU hardware-software co-design
Hardware redesigns for DIMM-PIM compatibility
Adaptive scheduler maximizes device parallelism
Q
Qingyuan Liu
Shanghai Jiao Tong University
Liyan Chen
Liyan Chen
Ph.D. Candidate, Department of Computer Science, Stevens Institute of Technology
Machine LearningComputer Vision
Y
Yanning Yang
Shanghai Jiao Tong University
H
Haocheng Wang
Shanghai Jiao Tong University
Dong Du
Dong Du
Associate Professor, Nanjing University of Science and Technology
Computer Graphics3D Computer Vision
Z
Zhigang Mao
Shanghai Jiao Tong University
N
Naifeng Jing
Shanghai Jiao Tong University
Yubin Xia
Yubin Xia
Professor, Shanghai Jiao Tong University
Operation SystemVirtualizationComputer ArchitectureSystem Security
H
Haibo Chen
Shanghai Jiao Tong University