🤖 AI Summary
This work addresses the memory bottleneck and the inefficiency of 3D NAND-based in-memory computing in supporting dynamic sparse expert activation when deploying Mixture-of-Experts (MoE) large language models on edge devices. To overcome these challenges, the authors propose NASiC, a novel architecture that uniquely integrates content-addressable memory (CAM) with multi-bit 3D NAND in-memory computing. NASiC leverages a CAM-based gating mechanism to enable dynamic expert selection and activation computation within a single cycle, while employing block-level parallelism and in-situ signed multi-bit expansion to significantly enhance parallelism and memory utilization. Experimental results demonstrate that, compared to existing approaches, NASiC achieves 4–114.8× higher performance and 3.9–70× better energy efficiency while maintaining high inference accuracy, offering a promising pathway for efficient edge deployment of MoE models.
📝 Abstract
The Mixture-of-Experts (MoE) models have emerged as the state-of-the-art paradigm for scaling up large language models (LLMs) without proportionally increased computational cost. However, its on-device deployment faces a critical challenge due to the large memory requirement for storing all expert parameters. 3D NAND-based computing-in-memory (CIM) architectures uniquely offer high storage capacity and reduced data movement, while they are ill-suited for MoE models with dynamically sparse expert activation, leading to a degradation of effective computational parallelism, along with underutilization of multibit storage capability of Flash cells. In this work, we proposed a 3D NAND-based content addressable-selected CIM architecture, dubbed as NASiC, which is tailored to MoE models. By leveraging the intrinsic string structure of 3D NAND technology, NASiC fuses the dynamical expert selection through CAM-based masking mechanism and activated expert computation through CIM into a single computation cycle, eradicating redundant computation and enhancing computational parallelism. Moreover, circuit-level optimizations and multibit CIM cell are co-designed with proposed NASiC architecture, featuring block-wise parallel computation with in-situ signed multibit input and weight expansion, substantially improving the throughput and energy-efficiency of NAND CIM array, as well as the utilization of high-density 3D NAND technology for MoE models. With extensive experimental results, we demonstrate NASiC achieves 4-114.8x improved performance and 3.9-70x improved energy efficiency over state-of-the-art designs, along with high accuracy, showing its great potential for efficient on-device MoE LLM inference.