NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work addresses the memory bottleneck and the inefficiency of 3D NAND-based in-memory computing in supporting dynamic sparse expert activation when deploying Mixture-of-Experts (MoE) large language models on edge devices. To overcome these challenges, the authors propose NASiC, a novel architecture that uniquely integrates content-addressable memory (CAM) with multi-bit 3D NAND in-memory computing. NASiC leverages a CAM-based gating mechanism to enable dynamic expert selection and activation computation within a single cycle, while employing block-level parallelism and in-situ signed multi-bit expansion to significantly enhance parallelism and memory utilization. Experimental results demonstrate that, compared to existing approaches, NASiC achieves 4–114.8× higher performance and 3.9–70× better energy efficiency while maintaining high inference accuracy, offering a promising pathway for efficient edge deployment of MoE models.

📝 Abstract

The Mixture-of-Experts (MoE) models have emerged as the state-of-the-art paradigm for scaling up large language models (LLMs) without proportionally increased computational cost. However, its on-device deployment faces a critical challenge due to the large memory requirement for storing all expert parameters. 3D NAND-based computing-in-memory (CIM) architectures uniquely offer high storage capacity and reduced data movement, while they are ill-suited for MoE models with dynamically sparse expert activation, leading to a degradation of effective computational parallelism, along with underutilization of multibit storage capability of Flash cells. In this work, we proposed a 3D NAND-based content addressable-selected CIM architecture, dubbed as NASiC, which is tailored to MoE models. By leveraging the intrinsic string structure of 3D NAND technology, NASiC fuses the dynamical expert selection through CAM-based masking mechanism and activated expert computation through CIM into a single computation cycle, eradicating redundant computation and enhancing computational parallelism. Moreover, circuit-level optimizations and multibit CIM cell are co-designed with proposed NASiC architecture, featuring block-wise parallel computation with in-situ signed multibit input and weight expansion, substantially improving the throughput and energy-efficiency of NAND CIM array, as well as the utilization of high-density 3D NAND technology for MoE models. With extensive experimental results, we demonstrate NASiC achieves 4-114.8x improved performance and 3.9-70x improved energy efficiency over state-of-the-art designs, along with high accuracy, showing its great potential for efficient on-device MoE LLM inference.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

3D NAND

Computing-in-Memory

On-Device Inference

Multibit Storage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Computing-in-Memory

3D NAND Flash

Mixture-of-Experts