FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

๐Ÿ“… 2026-04-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

234K/year
๐Ÿค– AI Summary
This work addresses the computational intensity, low energy efficiency, and insufficient on-chip data reuse in large language model (LLM) inference by proposing a fusion-driven compute-in-memory (CIM) architecture. It co-designs the attention mechanism by synergistically fusing QKแต€ and PV computations, innovatively integrating input-side (IP-CIM) and output-side (OP-CIM) compute-in-memory paradigms, and introducing a QO-stationary dataflow to maximize on-chip data reuse. Additionally, a pattern-aware online Softmax mechanism is incorporated to substantially reduce the overhead of nonlinear operations. Experimental evaluation on the LLaMA-3 model demonstrates that the proposed architecture achieves up to 1.98ร— speedup and 3.86ร— energy savings, attaining a system energy efficiency of 29.4 TOPS/W.
๐Ÿ“ Abstract
In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
compute-in-memory
energy efficiency
matrix multiplication
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compute-in-Memory
Operator Fusion
LLM Inference
Dataflow Optimization
Online Softmax
๐Ÿ”Ž Similar Papers