FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the computational intensity, low energy efficiency, and insufficient on-chip data reuse in large language model (LLM) inference by proposing a fusion-driven compute-in-memory (CIM) architecture. It co-designs the attention mechanism by synergistically fusing QKᵀ and PV computations, innovatively integrating input-side (IP-CIM) and output-side (OP-CIM) compute-in-memory paradigms, and introducing a QO-stationary dataflow to maximize on-chip data reuse. Additionally, a pattern-aware online Softmax mechanism is incorporated to substantially reduce the overhead of nonlinear operations. Experimental evaluation on the LLaMA-3 model demonstrates that the proposed architecture achieves up to 1.98× speedup and 3.86× energy savings, attaining a system energy efficiency of 29.4 TOPS/W.

📝 Abstract

In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

compute-in-memory

energy efficiency

matrix multiplication

attention mechanism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compute-in-Memory

Operator Fusion

LLM Inference