Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Sparse large language models (LLMs) suffer from low inference efficiency on compute-in-memory (CIM) architectures due to the memory wall in conventional von Neumann systems and poor CIM array utilization caused by naïve sparse matrix mapping. Method: This paper proposes the first automated mapping and scheduling framework tailored for block-diagonal sparse structures. It jointly optimizes sparse model fine-tuning, block-diagonal sparsity pattern design, CIM array mapping, and computation scheduling. Contribution/Results: The framework achieves over 50% higher CIM array utilization, reduces memory footprint and floating-point operations by more than 4×, and enables full-model on-chip storage with efficient inference. Its core innovation lies in deeply co-designing structured sparsity with CIM’s physical constraints—thereby resolving the critical mapping mismatch bottleneck in sparse acceleration.

Technology Category

Application Category

📝 Abstract

Structured sparsity enables deploying large language models (LLMs) on resource-constrained systems. Approaches like dense-to-sparse fine-tuning are particularly compelling, achieving remarkable structured sparsity by reducing the model size by over 6.7x, while still maintaining acceptable accuracy. Despite this reduction, LLM inference, especially the decode stage being inherently memory-bound, is extremely expensive on conventional Von-Neumann architectures. Compute-in-memory (CIM) architectures mitigate this by performing computations directly in memory, and when paired with sparse LLMs, enable storing and computing the entire model in memory, eliminating the data movement on the off-chip bus and improving efficiency. Nonetheless, naively mapping sparse matrices onto CIM arrays leads to poor array utilization and diminished computational efficiency. In this paper, we present an automated framework with novel mapping and scheduling strategies to accelerate sparse LLM inference on CIM accelerators. By exploiting block-diagonal sparsity, our approach improves CIM array utilization by over 50%, achieving more than 4x reduction in both memory footprint and the number of required floating-point operations.

Problem

Research questions and friction points this paper is trying to address.

Accelerating sparse LLM inference on memory-constrained systems

Improving computational efficiency of sparse matrices on CIM arrays

Reducing memory footprint and operations via block-diagonal sparsity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework for sparse LLM inference acceleration

Mapping and scheduling strategies for CIM accelerators

Exploiting block-diagonal sparsity to improve array utilization

🔎 Similar Papers

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs