Dissecting and Re-architecting 3D NAND Flash PIM Arrays for Efficient Single-Batch Token Generation in LLMs

📅 2025-11-16

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

To address DRAM capacity limitations and high GPU costs in single-batch token generation for large language models (LLMs), this work proposes a processing-in-memory (PIM) architecture leveraging 3D NAND flash. The design reconfigures NAND storage arrays, incorporates an H-tree interconnect topology, and introduces operation tiling and layer mapping strategies to enable high-density, low-latency in-memory inference within a compact die area (4.98 mm²). Compared to a four-GPU RTX 4090 system, the architecture achieves a 2.4× throughput improvement; its performance approaches that of a four-GPU A100 setup, incurring only a 4.9% latency overhead. This is the first systematic application of 3D NAND-based PIM to single-batch LLM inference, effectively overcoming the traditional memory wall. The work establishes a novel paradigm for cost-efficient, high-density AI accelerators by co-designing storage and computation at the hardware level.

Technology Category

Application Category

📝 Abstract

The advancement of large language models has led to models with billions of parameters, significantly increasing memory and compute demands. Serving such models on conventional hardware is challenging due to limited DRAM capacity and high GPU costs. Thus, in this work, we propose offloading the single-batch token generation to a 3D NAND flash processing-in-memory (PIM) device, leveraging its high storage density to overcome the DRAM capacity wall. We explore 3D NAND flash configurations and present a re-architected PIM array with an H-tree network for optimal latency and cell density. Along with the well-chosen PIM array size, we develop operation tiling and mapping methods for LLM layers, achieving a 2.4x speedup over four RTX4090 with vLLM and comparable performance to four A100 with only 4.9% latency overhead. Our detailed area analysis reveals that the proposed 3D NAND flash PIM architecture can be integrated within a 4.98mm2 die area under the memory array, without extra area overhead.

Problem

Research questions and friction points this paper is trying to address.

Overcoming DRAM capacity limitations for large language models

Reducing GPU costs through 3D NAND flash processing-in-memory

Optimizing latency and density for single-batch token generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offloading token generation to 3D NAND flash PIM

Re-architected PIM array with H-tree network

Operation tiling and mapping methods for LLMs

🔎 Similar Papers

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective