Dissecting and Re-architecting 3D NAND Flash PIM Arrays for Efficient Single-Batch Token Generation in LLMs

📅 2025-11-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address DRAM capacity limitations and high GPU costs in single-batch token generation for large language models (LLMs), this work proposes a processing-in-memory (PIM) architecture leveraging 3D NAND flash. The design reconfigures NAND storage arrays, incorporates an H-tree interconnect topology, and introduces operation tiling and layer mapping strategies to enable high-density, low-latency in-memory inference within a compact die area (4.98 mm²). Compared to a four-GPU RTX 4090 system, the architecture achieves a 2.4× throughput improvement; its performance approaches that of a four-GPU A100 setup, incurring only a 4.9% latency overhead. This is the first systematic application of 3D NAND-based PIM to single-batch LLM inference, effectively overcoming the traditional memory wall. The work establishes a novel paradigm for cost-efficient, high-density AI accelerators by co-designing storage and computation at the hardware level.

Technology Category

Application Category

📝 Abstract
The advancement of large language models has led to models with billions of parameters, significantly increasing memory and compute demands. Serving such models on conventional hardware is challenging due to limited DRAM capacity and high GPU costs. Thus, in this work, we propose offloading the single-batch token generation to a 3D NAND flash processing-in-memory (PIM) device, leveraging its high storage density to overcome the DRAM capacity wall. We explore 3D NAND flash configurations and present a re-architected PIM array with an H-tree network for optimal latency and cell density. Along with the well-chosen PIM array size, we develop operation tiling and mapping methods for LLM layers, achieving a 2.4x speedup over four RTX4090 with vLLM and comparable performance to four A100 with only 4.9% latency overhead. Our detailed area analysis reveals that the proposed 3D NAND flash PIM architecture can be integrated within a 4.98mm2 die area under the memory array, without extra area overhead.
Problem

Research questions and friction points this paper is trying to address.

Overcoming DRAM capacity limitations for large language models
Reducing GPU costs through 3D NAND flash processing-in-memory
Optimizing latency and density for single-batch token generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offloading token generation to 3D NAND flash PIM
Re-architected PIM array with H-tree network
Operation tiling and mapping methods for LLMs
🔎 Similar Papers
No similar papers found.
Y
Yongjoo Jang
Korea University, Seoul, South Korea
S
Sangwoo Hwang
Korea University, Seoul, South Korea
H
Hojin Lee
Korea University, Seoul, South Korea
S
Sangwoo Jung
Korea University, Seoul, South Korea
Donghun Lee
Donghun Lee
Korea University, Seoul, South Korea
Wonbo Shim
Wonbo Shim
Seoul National University of Science and Technology, Seoul , South Korea
Jaeha Kung
Jaeha Kung
Associate Professor, Korea University
Accelerator DesignApproximate ComputingML ArchitectureVLSI