🤖 AI Summary
To address DRAM capacity limitations and high GPU costs in single-batch token generation for large language models (LLMs), this work proposes a processing-in-memory (PIM) architecture leveraging 3D NAND flash. The design reconfigures NAND storage arrays, incorporates an H-tree interconnect topology, and introduces operation tiling and layer mapping strategies to enable high-density, low-latency in-memory inference within a compact die area (4.98 mm²). Compared to a four-GPU RTX 4090 system, the architecture achieves a 2.4× throughput improvement; its performance approaches that of a four-GPU A100 setup, incurring only a 4.9% latency overhead. This is the first systematic application of 3D NAND-based PIM to single-batch LLM inference, effectively overcoming the traditional memory wall. The work establishes a novel paradigm for cost-efficient, high-density AI accelerators by co-designing storage and computation at the hardware level.
📝 Abstract
The advancement of large language models has led to models with billions of parameters, significantly increasing memory and compute demands. Serving such models on conventional hardware is challenging due to limited DRAM capacity and high GPU costs. Thus, in this work, we propose offloading the single-batch token generation to a 3D NAND flash processing-in-memory (PIM) device, leveraging its high storage density to overcome the DRAM capacity wall. We explore 3D NAND flash configurations and present a re-architected PIM array with an H-tree network for optimal latency and cell density. Along with the well-chosen PIM array size, we develop operation tiling and mapping methods for LLM layers, achieving a 2.4x speedup over four RTX4090 with vLLM and comparable performance to four A100 with only 4.9% latency overhead. Our detailed area analysis reveals that the proposed 3D NAND flash PIM architecture can be integrated within a 4.98mm2 die area under the memory array, without extra area overhead.