🤖 AI Summary
This work addresses the memory bandwidth and storage access bottlenecks that hinder large language model (LLM) inference on resource-constrained edge devices. The authors propose a 3D NAND-centric heterogeneous architecture that offloads feed-forward network (FFN) computation into flash memory while handling attention mechanisms via lightweight CMOS logic in conjunction with external DRAM, enabled by wafer-scale stacking for tight integration. By deeply co-designing FFN execution with 3D NAND, the system supports page-level direct weight reads—eliminating DRAM intermediation—and integrates KV-cache-aware scheduling with an out-of-order execution processing element array. With only a 2.7% CMOS area overhead, the design achieves 16.7–37.9× speedup over A800 off-chip inference and up to 4.7× faster performance than state-of-the-art SSD-based approaches across OPT and LLaMA models with up to 30 billion parameters.
📝 Abstract
The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPU-based and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads feed-forward network (FFN) computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code (ECC) units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated ECC. Attention weights remain in DRAM, and a KV-cache-aware scheduler sustains throughput as the context length grows. Evaluated on OPT and LLaMA models with up to 30B parameters, NVLLM achieves a 16.7$\times$--37.9$\times$ speedup over A800-based out-of-core inference and up to 4.7$\times$ speedup over SSD-like designs, with only 2.7\% CMOS area overhead.