Accelerating LLM Inference with Flexible N:M Sparsity via A Fully Digital Compute-in-Memory Accelerator

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck where fixed N:M sparsity compromises both model expressiveness and hardware efficiency in large language model (LLM) inference, this paper proposes FLOW—a layer-adaptive, outlier-density-aware dynamic N:M sparsity method—alongside FlexCiM, a fully digital compute-in-memory (CiM) architecture. Our key contributions are: (1) the first layer-granular, variable N:M sparsity selection mechanism, which jointly optimizes accuracy and efficiency via outlier distribution modeling and adaptive search; and (2) a reconfigurable digital CiM macro supporting dynamic sub-macro aggregation/disseveration, coupled with a distribute-and-merge scheduling strategy that preserves sparsity flexibility while substantially reducing hardware overhead. Evaluated on Transformer and State Space Model (SSM) architectures, FLOW+FlexCiM achieves up to 36% higher accuracy, 1.75× lower inference latency, and 1.5× lower energy consumption compared to state-of-the-art sparse accelerators.

Technology Category

Application Category

📝 Abstract
Large language model (LLM) pruning with fixed N:M structured sparsity significantly limits the expressivity of the sparse model, yielding sub-optimal performance. In contrast, supporting multiple N:M patterns to provide sparse representational freedom introduces costly overhead in hardware. To address these challenges for LLMs, we first present a flexible layer-wise outlier-density-aware N:M sparsity (FLOW) selection method. FLOW enables the identification of optimal layer-wise N and M values (from a given range) by simultaneously accounting for the presence and distribution of outliers, allowing a higher degree of representational freedom. To deploy sparse models with such N:M flexibility, we then introduce a flexible, low-overhead digital compute-in-memory architecture (FlexCiM). FlexCiM supports diverse sparsity patterns by partitioning a digital CiM (DCiM) macro into smaller sub-macros, which are adaptively aggregated and disaggregated through distribution and merging mechanisms for different N and M values. Extensive experiments on both transformer-based and recurrence-based state space foundation models (SSMs) demonstrate that FLOW outperforms existing alternatives with an accuracy improvement of up to 36%, while FlexCiM achieves up to 1.75x lower inference latency and 1.5x lower energy consumption compared to existing sparse accelerators. Code is available at: https://github.com/FLOW-open-project/FLOW
Problem

Research questions and friction points this paper is trying to address.

Flexible N:M sparsity for better LLM pruning
Low-overhead hardware for diverse sparsity patterns
Improving accuracy and efficiency in LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flexible N:M sparsity selection method FLOW
Digital compute-in-memory architecture FlexCiM
Adaptive sub-macro aggregation for sparsity
🔎 Similar Papers
No similar papers found.